The tree module#

This module defines the tree structure a text is parsed into.

A tree consists of Context and Token objects. (Both inherit from the base class Node, which defines the shared methods and properties.)

A Context is a list containing Tokens and other Contexts. A Context is created when a lexicon becomes active. A Context knows its parent Context and its lexicon.

A Token represents one parsed piece of text. A Token is created when a rule in the lexicon matches. A Token knows its parent Context, its position in the text and the action that was specified in the rule.

A Context is always non-empty, except for the root Context, which represents the root lexicon and can be empty if the document did not generate a single token.

The tree structure is easy to navigate, no special objects or iterators are necessary for that. To find a token at a certain position in a context, use Context.find_token() and its relatives. From every node you can iterate forward() and backward(). Use the methods like left_siblings() and right_siblings() to traverse the current context.

class Node[source]#

Bases: object

Methods that are shared by Token and Context.

is_token = False#
is_context = False#
property parent#

The parent Context (or None; uses a weak reference).

copy(parent=None)[source]#

Return a copy of the Node, but with the specified parent.

dump(file=None, style=None, depth=0)[source]#

Display a graphical representation of the node and its contents.

The file object defaults to stdout, and the style to “round”. You can choose any style that’s in the DUMP_STYLES dictionary.

property pwd#

Show the ancestry, for debugging purposes.

parent_index()[source]#

Return our index in the parent.

This is recommended above using parent.index(self), because this method finds our index using a binary search on position, while the latter is a linear search, which is certainly slower with a large number of children.

root()[source]#

Return the root node.

is_last()[source]#

Return True if this Node is the last child of its parent.

Fails if called on the root element.

is_first()[source]#

Return True if this Node is the first child of its parent.

Fails if called on the root element.

is_ancestor_of(node)[source]#

Return True if this Node is an ancestor of the other Node.

ancestors(upto=None)[source]#

Climb the tree up over the parents.

If upto is given and it is one of the ancestors, stop after yielding that ancestor. Otherwise iteration stops at the root node.

ancestors_with_index(upto=None)[source]#

Yield the ancestors(upto), and the index of each node in the parent.

common_ancestor(other)[source]#

Return the common ancestor with the Context or Token.

depth()[source]#

Return the number of ancestors.

left_sibling()[source]#

Return the left sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_sibling()[source]#

Return the right sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

left_siblings()[source]#

Yield the left siblings of this node in reverse order, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_siblings()[source]#

Yield the right siblings of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

next_token()[source]#

Return the following Token, if any.

previous_token()[source]#

Return the preceding Token, if any.

forward(upto=None)[source]#

Yield all Tokens in forward direction, starting at the right sibling.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

backward(upto=None)[source]#

Yield all Tokens in backward direction, starting at the left sibling.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

property query#

Query this node in different ways; see the query module.

delete()[source]#

Remove this node from its parent.

If the parent would become empty, it is removed too. Returns the first non-empty ancestor.

class Token(parent, pos, text, action)[source]#

Bases: Node

A Token instance represents a lexed piece of text.

When a pattern rule in a lexicon matches the text, a Token is created. When that rule would create more than one Token from a single regular expression match, GroupToken objects are created instead, carrying the index of the token in the group in the group attribute. The group attribute is readonly None for normal tokens.

GroupTokens are thus always adjacent in the same context. If you want to retokenize text starting at some position, be sure you are at the start of a grouped token, e.g.:

t = ctx.find_token(45)
if t.group:
    for t in t.left_siblings():
        if not t.group:
            break
pos = t.pos

Alternatively, you can use the GroupToken.get_group_* methods.

(A GroupToken is just a normal Token otherwise, the reason a subclass was created is that the group attribute is unused in by far the most tokens, so it does not use any memory. You never need to reference the GroupToken class; just test the group attribute if you want to know if a token belongs to a group that originated from a single match.)

When iterating over the children of a Context (which may be Context or Token instances), you can use the is_token attribute to determine whether the node child is a token, which is easier than to call isinstance(t, Token) each time.

From a token, you can iterate forward() or backward() to find adjacent tokens. If you only want to stay in the current context, use the various sibling methods, such as right_sibling().

By traversing the ancestors() of a token or context, you can find which lexicons created the tokens.

You can compare a Token instance with a string. Instead of:

if token.text == "bla":
    do_something()

you can do:

if token == "bla":
    do_something()

You can call len() on a token, which returns the length of the token’s text attribute, and you can use the string format method to embed the token’s text in another string:

s = "blabla {}".format(token)

A token always has a parent, and that parent is always a Context instance.

is_token = True#

Always True for Token

pos#

The position in the original text

text#

The text of this token

action#

The action specified by the lexicon rule that created the token

property end#

The end position of this token in the original text.

group = None#

Always None for Token, an integer for GroupToken

copy(parent=None)[source]#

Return a copy of the Token, but with the specified parent.

equals(other)[source]#

Return True if the other Token has the same text and action attributes and the same context ancestry (see also state_matches()).

Note that the pos attribute is not compared.

state_matches(other)[source]#

Return True if the other Token has the same lexicons in the ancestors.

forward_including(upto=None)[source]#

Yield all tokens in forward direction, including self.

backward_including(upto=None)[source]#

Yield all tokens in backward direction, including self.

forward_until_including(other)[source]#

Yield all tokens starting with us and upto and including the other.

common_ancestor_with_trail(other)[source]#

Return a three-tuple(context, trail_self, trail_other).

The context is the common ancestor such as returned by common_ancestor, if any. trail_self is a tuple of indices from the common ancestor upto self, and trail_other is a tuple of indices from the same ancestor upto the other Token.

If there is no common ancestor, all three are None. But normally, all nodes share the root context, so that will normally be the upmost common ancestor.

range(other)[source]#

Return a Range from this token upto and including the other.

Returns None if the other Token does not belong to the same tree.

class GroupToken(group, parent, pos, text, action)[source]#

Bases: Token

A Token class that allows setting the group attribute.

For normal Token instances, group is a class attribute that is always None. For Tokens that belong to a group, i.e. originated from a single regular expression match, the group attribute is the index of the token in the group of tokens that were created together.

The last token in the group has a negative value, so it can be recognized as the last. For example, tokens of a three-group have the indices 0, 1 and -2.

The methods get_group(), get_group_start() and get_group_end() can only be reliably used when there are no tokens deleted from the tree, and when the tokens really have a parent.

group#

The index of this token in a group (negated for the last token in a group)

copy(parent=None)[source]#

Return a copy of the Token, but with the specified parent.

classmethod make_group(parent, lexemes)[source]#

Create a tuple of GroupTokens for the lexemes.

get_group()[source]#

Return the whole group this token belongs to as a list.

get_group_start()[source]#

Return the first token of the group this token belongs to.

get_group_end()[source]#

Return the last token of the group this token belongs to.

class Context(lexicon, parent)[source]#

Bases: list, Node

A Context represents a list of tokens and contexts.

The lexicon that created the tokens is in the lexicon attribute.

If a pattern rule jumps to another lexicon, a sub-Context is created and tokens are added there. If that lexicon pops back to the current one, new tokens can appear after the sub-context. (So the token that caused the jump to the sub-context normally preceeds the context it created.)

A context has a parent attribute, which can point to an enclosing context. The root context has parent None.

When iterating over the children of a Context (which may be Context or Token instances), you can use the is_context attribute to determine whether the node child is a context, which is easier than to call isinstance(node, Context) each time.

You can quickly find tokens in a context, based on text:

if "bla" in context:
    # etc

Or child contexts, based on lexicon:

if MyLanguage.lexicon in context:
    # etc

And if you want to know which token is on a certain position in the text, use e.g.:

context.find_token(45)

which, using a bisection algorithm, quickly returns the token, which might be in any sub-context of the current context.

is_context = True#

Always True for Context

lexicon#

The lexicon this context was instantiated with.

property ls#

List the contents of this Context, for debugging purposes.

copy(parent=None)[source]#

Return a copy of the context, but with the specified parent.

property pos#

Return the position or our first token. Returns 0 if empty.

property end#

Return the end position or our last token. Returns 0 if empty.

is_root()[source]#

Return True if this Context has no parent node.

height()[source]#

Return the height of the tree (the longest distance to a descendant).

tokens(reverse=False)[source]#

Yield all Tokens, descending into nested Contexts.

If reverse is set to True, yield all tokens in backward direction.

first_token()[source]#

Return our first Token.

last_token()[source]#

Return our last token.

find(pos)[source]#

Return the index of our child at (or to the right of) pos.

Returns -1 if there is no such child.

find_context(pos)[source]#

Return the younghest Context at position (or self).

find_token(pos)[source]#

Return the Token at or to the right of position.

Returns None if there is no such token.

find_token_with_trail(pos)[source]#

Return the Token at or to the right of position, and the trail of indices.

The trail is the list of indices where the token was found. Returns (None, None) if there is no such token. Here is an example:

>>> import parce
>>> tree = parce.root(parce.find('css'), open('parce/themes/default.css').read())
>>> tree.find_token_with_trail(600)
(<Token ' Selected te...ow has focus ' at 566:607 (Comment)>, [21, 0])
>>> tree[21][0]
<Token ' Selected te...ow has focus ' at 566:607 (Comment)>
find_left(pos)[source]#

Return the index of our child at or to the left of pos.

Returns -1 if there is no such child.

find_token_left(pos)[source]#

Return the Token at or to the left of position.

Returns None if there is no such token.

find_token_left_with_trail(pos)[source]#

Return the Token at or to the left of position, and the trail of indices.

Returns (None, None) if there is no such token.

find_token_after(pos)[source]#

Return the first token completely right from pos.

Returns None if there is no token right from pos.

find_token_before(pos)[source]#

Return the last token completely left from pos.

Returns None if there is no token left from pos.

range(start=0, end=None)[source]#

Return a Range.

The ancestor of the range is the common ancestor of the tokens found at start and end (or the context itself if start or end fall outside this context). If start is 0 and end is None, the range encompasses the full context.

Returns None if this context is empty.

class Range(ancestor, start_trail=None, end_trail=None)[source]#

Bases: object

A Range denotes a range of a tree structure.

A range is defined by an ancestor context and possibly empty lists pointing to the start and end token, if specified. If both trails are not specified, the range encompasses the full context.

ancestor#

The specified ancestor

start_trail#

The specified start trail (empty list by default)

end_trail#

The specified end trail (empty list by default)

property pos#

The position of the first token in our range.

property end#

The end position of the last token in our range.

classmethod from_tree(tree, start=0, end=None)[source]#

Create a Range.

The ancestor is the common ancestor of the tokens found at start and end (or the tree itself if start or end fall outside the range of the tree). If start is 0 and end is None, the range encompasses the full tree.

Returns None if the tree is empty.

slices(target_factory=None)[source]#

Yield (context, slice) tuples.

The yielded slices include the tokens at the end of start and end trail.

If you specify a target_factory, it should be a TargetFactory object, and it will be updated along with the yielded slices.

tokens()[source]#

Yield all tokens in this range.

The first and last tokens may overlap with the start and end positions.

make_tokens(lexemes, parent=None)[source]#

Factory returning a tuple of one or more Token instances for the lexemes.

The lexemes argument is an iterable of three-tuples like the lexemes in an Event namedtuple defined in the lexer module. If there is more than one lexeme, GroupToken instances are created.

The specified parent context is set as parent, if given.