The tree module#
This module defines the tree structure a text is parsed into.
A tree consists of Context and Token objects. (Both inherit from the base class Node, which defines the shared methods and properties.)
A Context
is a list containing Tokens and other Contexts. A Context is
created when a lexicon becomes active. A Context knows its parent Context and
its lexicon.
A Token
represents one parsed piece of text. A Token is created when a
rule in the lexicon matches. A Token knows its parent Context, its position in
the text and the action that was specified in the rule.
A Context is always non-empty, except for the root Context, which represents the root lexicon and can be empty if the document did not generate a single token.
The tree structure is easy to navigate, no special objects or iterators are
necessary for that. To find a token at a certain position in a context, use
Context.find_token()
and its relatives. From every node you can iterate
forward()
and backward()
. Use the methods like
left_siblings()
and right_siblings()
to traverse the
current context.
- class Node[source]#
Bases:
object
Methods that are shared by Token and Context.
- is_token = False#
- is_context = False#
- property parent#
The parent Context (or None; uses a weak reference).
- dump(file=None, style=None, depth=0)[source]#
Display a graphical representation of the node and its contents.
The file object defaults to stdout, and the style to “round”. You can choose any style that’s in the
DUMP_STYLES
dictionary.
- property pwd#
Show the ancestry, for debugging purposes.
- parent_index()[source]#
Return our index in the parent.
This is recommended above using parent.index(self), because this method finds our index using a binary search on position, while the latter is a linear search, which is certainly slower with a large number of children.
- is_last()[source]#
Return True if this Node is the last child of its parent.
Fails if called on the root element.
- is_first()[source]#
Return True if this Node is the first child of its parent.
Fails if called on the root element.
- ancestors(upto=None)[source]#
Climb the tree up over the parents.
If upto is given and it is one of the ancestors, stop after yielding that ancestor. Otherwise iteration stops at the root node.
- ancestors_with_index(upto=None)[source]#
Yield the ancestors(upto), and the index of each node in the parent.
- left_sibling()[source]#
Return the left sibling of this node, if any.
Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.
- right_sibling()[source]#
Return the right sibling of this node, if any.
Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.
- left_siblings()[source]#
Yield the left siblings of this node in reverse order, if any.
Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.
- right_siblings()[source]#
Yield the right siblings of this node, if any.
Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.
- forward(upto=None)[source]#
Yield all Tokens in forward direction, starting at the right sibling.
Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.
- class Token(parent, pos, text, action)[source]#
Bases:
Node
A Token instance represents a lexed piece of text.
When a pattern rule in a lexicon matches the text, a Token is created. When that rule would create more than one Token from a single regular expression match, GroupToken objects are created instead, carrying the index of the token in the group in the group attribute. The group attribute is readonly None for normal tokens.
GroupTokens are thus always adjacent in the same context. If you want to retokenize text starting at some position, be sure you are at the start of a grouped token, e.g.:
t = ctx.find_token(45) if t.group: for t in t.left_siblings(): if not t.group: break pos = t.pos
Alternatively, you can use the GroupToken.get_group_* methods.
(A GroupToken is just a normal Token otherwise, the reason a subclass was created is that the group attribute is unused in by far the most tokens, so it does not use any memory. You never need to reference the GroupToken class; just test the group attribute if you want to know if a token belongs to a group that originated from a single match.)
When iterating over the children of a Context (which may be Context or Token instances), you can use the is_token attribute to determine whether the node child is a token, which is easier than to call isinstance(t, Token) each time.
From a token, you can iterate forward() or backward() to find adjacent tokens. If you only want to stay in the current context, use the various sibling methods, such as right_sibling().
By traversing the ancestors() of a token or context, you can find which lexicons created the tokens.
You can compare a Token instance with a string. Instead of:
if token.text == "bla": do_something()
you can do:
if token == "bla": do_something()
You can call len() on a token, which returns the length of the token’s text attribute, and you can use the string format method to embed the token’s text in another string:
s = "blabla {}".format(token)
A token always has a parent, and that parent is always a Context instance.
- is_token = True#
Always True for Token
- pos#
The position in the original text
- text#
The text of this token
- action#
The action specified by the lexicon rule that created the token
- property end#
The end position of this token in the original text.
- group = None#
Always None for Token, an integer for
GroupToken
- equals(other)[source]#
Return True if the other Token has the same
text
andaction
attributes and the same context ancestry (see alsostate_matches()
).Note that the
pos
attribute is not compared.
- state_matches(other)[source]#
Return True if the other Token has the same lexicons in the ancestors.
- forward_until_including(other)[source]#
Yield all tokens starting with us and upto and including the other.
- common_ancestor_with_trail(other)[source]#
Return a three-tuple(context, trail_self, trail_other).
The context is the common ancestor such as returned by common_ancestor, if any. trail_self is a tuple of indices from the common ancestor upto self, and trail_other is a tuple of indices from the same ancestor upto the other Token.
If there is no common ancestor, all three are None. But normally, all nodes share the root context, so that will normally be the upmost common ancestor.
- class GroupToken(group, parent, pos, text, action)[source]#
Bases:
Token
A Token class that allows setting the group attribute.
For normal Token instances, group is a class attribute that is always None. For Tokens that belong to a group, i.e. originated from a single regular expression match, the group attribute is the index of the token in the group of tokens that were created together.
The last token in the group has a negative value, so it can be recognized as the last. For example, tokens of a three-group have the indices 0, 1 and -2.
The methods
get_group()
,get_group_start()
andget_group_end()
can only be reliably used when there are no tokens deleted from the tree, and when the tokens really have a parent.- group#
The index of this token in a group (negated for the last token in a group)
- class Context(lexicon, parent)[source]#
-
A Context represents a list of tokens and contexts.
The lexicon that created the tokens is in the lexicon attribute.
If a pattern rule jumps to another lexicon, a sub-Context is created and tokens are added there. If that lexicon pops back to the current one, new tokens can appear after the sub-context. (So the token that caused the jump to the sub-context normally preceeds the context it created.)
A context has a parent attribute, which can point to an enclosing context. The root context has parent None.
When iterating over the children of a Context (which may be Context or Token instances), you can use the is_context attribute to determine whether the node child is a context, which is easier than to call isinstance(node, Context) each time.
You can quickly find tokens in a context, based on text:
if "bla" in context: # etc
Or child contexts, based on lexicon:
if MyLanguage.lexicon in context: # etc
And if you want to know which token is on a certain position in the text, use e.g.:
context.find_token(45)
which, using a bisection algorithm, quickly returns the token, which might be in any sub-context of the current context.
- is_context = True#
Always True for Context
- lexicon#
The lexicon this context was instantiated with.
- property ls#
List the contents of this Context, for debugging purposes.
- property pos#
Return the position or our first token. Returns 0 if empty.
- property end#
Return the end position or our last token. Returns 0 if empty.
- tokens(reverse=False)[source]#
Yield all Tokens, descending into nested Contexts.
If
reverse
is set to True, yield all tokens in backward direction.
- find(pos)[source]#
Return the index of our child at (or to the right of) pos.
Returns -1 if there is no such child.
- find_token(pos)[source]#
Return the Token at or to the right of position.
Returns None if there is no such token.
- find_token_with_trail(pos)[source]#
Return the Token at or to the right of position, and the trail of indices.
The trail is the list of indices where the token was found. Returns (None, None) if there is no such token. Here is an example:
>>> import parce >>> tree = parce.root(parce.find('css'), open('parce/themes/default.css').read()) >>> tree.find_token_with_trail(600) (<Token ' Selected te...ow has focus ' at 566:607 (Comment)>, [21, 0]) >>> tree[21][0] <Token ' Selected te...ow has focus ' at 566:607 (Comment)>
- find_left(pos)[source]#
Return the index of our child at or to the left of pos.
Returns -1 if there is no such child.
- find_token_left(pos)[source]#
Return the Token at or to the left of position.
Returns None if there is no such token.
- find_token_left_with_trail(pos)[source]#
Return the Token at or to the left of position, and the trail of indices.
Returns (None, None) if there is no such token.
- find_token_after(pos)[source]#
Return the first token completely right from pos.
Returns None if there is no token right from pos.
- find_token_before(pos)[source]#
Return the last token completely left from pos.
Returns None if there is no token left from pos.
- range(start=0, end=None)[source]#
Return a
Range
.The ancestor of the range is the common ancestor of the tokens found at start and end (or the context itself if start or end fall outside this context). If start is 0 and end is None, the range encompasses the full context.
Returns None if this context is empty.
- class Range(ancestor, start_trail=None, end_trail=None)[source]#
Bases:
object
A Range denotes a range of a tree structure.
A range is defined by an ancestor context and possibly empty lists pointing to the start and end token, if specified. If both trails are not specified, the range encompasses the full context.
- ancestor#
The specified ancestor
- start_trail#
The specified start trail (empty list by default)
- end_trail#
The specified end trail (empty list by default)
- property pos#
The position of the first token in our range.
- property end#
The end position of the last token in our range.
- classmethod from_tree(tree, start=0, end=None)[source]#
Create a Range.
The ancestor is the common ancestor of the tokens found at start and end (or the tree itself if start or end fall outside the range of the tree). If start is 0 and end is None, the range encompasses the full tree.
Returns None if the tree is empty.
- slices(target_factory=None)[source]#
Yield (context, slice) tuples.
The yielded slices include the tokens at the end of start and end trail.
If you specify a
target_factory
, it should be aTargetFactory
object, and it will be updated along with the yielded slices.
- make_tokens(lexemes, parent=None)[source]#
Factory returning a tuple of one or more
Token
instances for the lexemes.The
lexemes
argument is an iterable of three-tuples like thelexemes
in anEvent
namedtuple defined in thelexer
module. If there is more than one lexeme,GroupToken
instances are created.The specified
parent
context is set as parent, if given.