The tree module

This module defines the tree structure a text is parsed into.

A tree consists of Context and Token objects.

A Context is a list containing Tokens and other Contexts. A Context is created when a lexicon becomes active. A Context knows its parent Context and its lexicon.

A Token represents one parsed piece of text. A Token is created when a rule in the lexicon matches. A Token knows its parent Context, its position in the text and the action that was specified in the rule.

The root Context is always one, and it represents the root lexicon. A Context is always non-empty, except for the root Context, which is empty if the document did not generate a single token.

The tree structure is very easy to navigate, no special objects or iterators are necessary for that.

To find a token at a certain position in a context, use find_token() and its relatives. From every token you can iterate forward() and backward(). Use the methods like left_siblings() and right_siblings() to traverse the current context.

See also the documentation for Token and Context.

class Node[source]

Bases: object

Methods that are shared by Token and Context.

is_token = False
is_context = False
property parent

The parent Context (or None; uses a weak reference).

dump(file=None, style=None, depth=0)[source]

Display a graphical representation of the node and its contents.

The file object defaults to stdout, and the style to “round”. You can choose any style that’s in the DUMP_STYLES dictionary.

parent_index()[source]

Return our index in the parent.

This is recommended above using parent.index(self), because this method finds our index using a binary search on position, while the latter is a linear search, which is certainly slower with a large number of children.

root()[source]

Return the root node.

is_root()[source]

Return True if this Node has no parent node.

is_last()[source]

Return True if this Node is the last child of its parent.

Fails if called on the root element.

is_first()[source]

Return True if this Node is the first child of its parent.

Fails if called on the root element.

is_ancestor_of(node)[source]

Return True if this Node is an ancestor of the other Node.

ancestors(upto=None)[source]

Climb the tree up over the parents.

If upto is given and it is one of the ancestors, stop after yielding that ancestor. Otherwise iteration stops at the root node.

ancestors_with_index(upto=None)[source]

Yield the ancestors(upto), and the index of each node in the parent.

common_ancestor(other)[source]

Return the common ancestor with the Context or Token.

depth()[source]

Return the number of ancestors.

left_sibling()[source]

Return the left sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_sibling()[source]

Return the right sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

left_siblings()[source]

Yield the left siblings of this node in reverse order, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_siblings()[source]

Yield the right siblings of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

next_token()[source]

Return the following Token, if any.

previous_token()[source]

Return the preceding Token, if any.

forward(upto=None)[source]

Yield all Tokens in forward direction.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

backward(upto=None)[source]

Yield all Tokens in backward direction.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

property query

Query this node in different ways; see the query module.

delete()[source]

Remove this node from its parent.

If the parent would become empty, it is removed too. Returns the first non-empty ancestor.

class Token(parent, pos, text, action)[source]

Bases: parce.tree.Node

A Token instance represents a lexed piece of text.

A token has the following attributes:

parent:

the Context node to which the token was added

pos:

the position of the token in the original text

end:

the end position of the token in the original text

text:

the text of the token

action:

the action specified by the lexicon rule that created the token

When a pattern rule in a lexicon matches the text, a Token is created. When that rule would create more than one Token from a single regular expression match, _GroupToken objects are created instead, carrying the index of the token in the group in the group attribute. The group attribute is readonly None for normal tokens.

GroupTokens are thus always adjacent in the same context. If you want to retokenize text starting at some position, be sure you are at the start of a grouped token, e.g.:

t = ctx.find_token(45)
if t.group:
    for t in t.left_siblings():
        if not t.group:
            break
pos = t.pos

(A _GroupToken is just a normal Token otherwise, the reason a subclass was created is that the group attribute is unused in by far the most tokens, so it does not use any memory. You never need to reference the _GroupToken class; just test the group attribute if you want to know if a token belongs to a group that originated from a single match.)

When iterating over the children of a Context (which may be Context or Token instances), you can use the is_token attribute to determine whether the node child is a token, which is easier than to call isinstance(t, Token) each time.

From a token, you can iterate forward() or backward() to find adjacent tokens. If you only want to stay in the current context, use the various sibling methods, such as right_sibling().

By traversing the ancestors() of a token or context, you can find which lexicons created the tokens.

You can compare a Token instance with a string. Instead of:

if token.text == "bla":
    do_something()

you can do:

if token == "bla":
    do_something()

You can call len() on a token, which returns the length of the token’s text attribute, and you can use the string format method to embed the token’s text in another string:

s = "blabla {}".format(token)

A token always has a parent, and that parent is always a Context instance.

is_token = True
group = None
pos
text
action
copy(parent=None)[source]

Return a copy of the Token, but with the specified parent.

equals(other)[source]

Return True if the other Token has the same text and action attributes and the same context ancestry (see also state_matches()).

Note that the pos attribute is not compared.

state_matches(other)[source]

Return True if the other Token has the same lexicons in the ancestors.

property end
forward_including(upto=None)[source]

Yield all tokens in forward direction, including self.

backward_including(upto=None)[source]

Yield all tokens in backward direction, including self.

forward_until_including(other)[source]

Yield all tokens starting with us and upto and including the other.

common_ancestor_with_trail(other)[source]

Return a three-tuple(context, trail_self, trail_other).

The context is the common ancestor such as returned by common_ancestor, if any. trail_self is a tuple of indices from the common ancestor upto self, and trail_other is a tuple of indices from the same ancestor upto the other Token.

If there is no common ancestor, all three are None. But normally, all nodes share the root context, so that will normally be the upmost common ancestor.

target()[source]

Return the first context directly to the right of this Token.

The context should be the right sibling of the token, or of any of its ancestors. If the token is part of a group, the context is found immediately next to the last member of the group. The found context may also be a child of the grand-parents of this token, in case the target popped contexts first.

In all cases, the returned context is the one started by the target in the lexicon rule that created this token.

class Context(lexicon, parent)[source]

Bases: list, parce.tree.Node

A Context represents a list of tokens and contexts.

The lexicon that created the tokens is in the lexicon attribute.

If a pattern rule jumps to another lexicon, a sub-Context is created and tokens are added there. If that lexicon pops back to the current one, new tokens can appear after the sub-context. (So the token that caused the jump to the sub-context normally preceeds the context it created.)

A context has a parent attribute, which can point to an enclosing context. The root context has parent None.

When iterating over the children of a Context (which may be Context or Token instances), you can use the is_context attribute to determine whether the node child is a context, which is easier than to call isinstance(node, Context) each time.

You can quickly find tokens in a context, based on text:

if "bla" in context:
    # etc

Or child contexts, based on lexicon:

if MyLanguage.lexicon in context:
    # etc

And if you want to know which token is on a certain position in the text, use e.g.:

context.find_token(45)

which, using a bisection algorithm, quickly returns the token, which might be in any sub-context of the current context.

is_context = True
lexicon
copy(parent=None)[source]

Return a copy of the context, but with the specified parent.

property pos

Return the position or our first token. Returns 0 if empty.

property end

Return the end position or our last token. Returns 0 if empty.

height()[source]

Return the height of the tree (the longest distance to a descendant).

tokens()[source]

Yield all Tokens, descending into nested Contexts.

tokens_bw()[source]

Yield all Tokens, descending into nested Contexts, in backward direction.

first_token()[source]

Return our first Token.

last_token()[source]

Return our last token.

find(pos)[source]

Return the index of our child at pos.

find_token(pos)[source]

Return the Token at or to the right of position.

find_token_with_trail(pos)[source]

Return the Token at or to the right of position, and the trail of indices.

find_left(pos)[source]

Return the index of our child at or to the left of pos.

find_token_left(pos)[source]

Return the Token at or to the left of position.

find_token_left_with_trail(pos)[source]

Return the Token at or to the left of position, and the trail of indices.

find_token_after(pos)[source]

Return the first token completely right from pos.

Returns None if there is no token right from pos.

find_token_before(pos)[source]

Return the last token completely left from pos.

Returns None if there is no token left from pos.

tokens_range(start=0, end=None)[source]

Yield all tokens (that completely fill this text range if specified).

The first and last tokens may overlap with the start and end positions.

context_slices(start=0, end=None)[source]

Yield (context, slice) tuples to yield tokens from.

Yield the tokens using the context[slice] notation. The first and last tokens that would be yielded from the iterables may overlap with the start and end positions.

context_trails(start=0, end=None)[source]

Return a three-tuple(context, start_trail, end_trail).

This can be used to denote a range of the tree structure in slices. The returned context is the common ancestor of the tokens found at start and end (or the current node if start or end fall outside the range of the node). The trails are (possibly empty) lists of indices pointing to the start and end token, if any.

slices(start_trail, end_trail, target_factory=None)[source]

Yield from the current context (context, slice) tuples.

start_trail and end_trail both are lists of indices that point to descendant tokens of this context. The yielded slices include these tokens.

If you specify a target_factory, it should be a TargetFactory object, and it will be updated along with the yielded slices.

source()[source]

Return the first Token, if any, when going to the left from this context.

The returned token is the one that created us, that this context the target is for. If the token is member of a group, the first group member is returned.

make_tokens(event, parent=None)[source]

Factory returning a tuple of one or more Token instances for the event.

The event is an Event namedtuple defined in the mod:~parce.lexer module. If the event contains more than one token, _GroupToken instances are created.

get_group(token)[source]

For a token that belongs to a group, return the whole group as a list.

get_group_start(token)[source]

For a token that belongs to a group, return the first token of the group.

get_group_end(token)[source]

For a token that belongs to a group, return the last token of the group.