Accessing the Tree Structure

When you have parsed text, the result is a tree structure of Tokens, contained by Contexts, which may be nested in other Contexts.

Let’s look at the generated token tree of the simple example of the Getting started section:

>>> tree.dump()
<Context Nonsense.root at 1-108 (19 children)>
 ├╴<Token 'Some' at 1:5 (Text)>
 ├╴<Token 'text' at 6:10 (Text)>
 ├╴<Token 'with' at 11:15 (Text)>
 ├╴<Token '3' at 16:17 (Literal.Number)>
 ├╴<Token 'numbers' at 18:25 (Text)>
 ├╴<Token 'and' at 26:29 (Text)>
 ├╴<Token '1' at 30:31 (Literal.Number)>
 ├╴<Token '"' at 32:33 (Literal.String)>
 ├╴<Context Nonsense.string at 33-67 (2 children)>
 │  ├╴<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
 │  ╰╴<Token '"' at 66:67 (Literal.String)>
 ├╴<Token ',' at 67:68 (Delimiter)>
 ├╴<Token 'and' at 69:72 (Text)>
 ├╴<Token '1' at 73:74 (Literal.Number)>
 ├╴<Token '%' at 75:76 (Comment)>
 ├╴<Context Nonsense.comment at 76-89 (1 child)>
 │  ╰╴<Token ' comment that' at 76:89 (Comment)>
 ├╴<Token 'ends' at 90:94 (Text)>
 ├╴<Token 'on' at 95:97 (Text)>
 ├╴<Token 'a' at 98:99 (Text)>
 ├╴<Token 'newline' at 100:107 (Text)>
 ╰╴<Token '.' at 107:108 (Delimiter)>

Token

We see that the Token instances represent the matched text. Every Token has the matched text in the text attribute, the position where it is in the source text in the pos attribute, and the action it was given in the action attribute. Besides that, Tokens also have an end attribute, which is actually a property and basically returns self.pos + len(self.text).

Although a Token is not a string, you can test for equality:

if token == "bla":
    # do something

Also, you can check if some text is in some Context:

if 'and' in tree:
    # do some_thing if 'and' is in the root context.

Context

A Context is basically a Python list, and it has the lexicon that created it in the lexicon attribute. The root of the tree is called the root context, it carries the root lexicon. You can access its child contexts and tokens with item or slice notation:

>>> print(tree[2])
<Token 'with' at 11:15 (Text)>

Besides that, Context has a pos and end attribute, which refer to the pos value of the first Token in the context, and the end value of the last Token in the context (or a sub-context).

Just like is is possible with Token to compare with a string, a Context can be compared to a Lexicon object. So it is possible to write:

>>> tree[8] == Nonsense.string
True
>>> Nonsense.comment in tree
True

A Context is never empty: if the parser switches to a new lexicon, but the lexicon does not generate any Token, the empty Context is discarded. Only the root context can be empty.

Traversing the tree structure

Both Token and Context have a parent atribute that points to its parent Context. Only for the root context, parent is None.

Token and Context both inherit from Node, which defines a lot of useful methods to traverse the tree structure.

Members shared by Token and Context

These are the attributes Token and Context both provide:

parent

The parent Context, the root context has parent None.

pos, end

The starting resp. ending position of this node in the source text.

is_token

False for Context, True for Token

is_context

True for Context, False for Token

These are the methods Token and Context both provide:

class Node[source]

Methods that are shared by Token and Context.

property parent

The parent Context (or None; uses a weak reference).

dump(file=None, style=None, depth=0)[source]

Display a graphical representation of the node and its contents.

The file object defaults to stdout, and the style to “round”. You can choose any style that’s in the DUMP_STYLES dictionary.

parent_index()[source]

Return our index in the parent.

This is recommended above using parent.index(self), because this method finds our index using a binary search on position, while the latter is a linear search, which is certainly slower with a large number of children.

root()[source]

Return the root node.

is_root()[source]

Return True if this Node has no parent node.

is_last()[source]

Return True if this Node is the last child of its parent.

Fails if called on the root element.

is_first()[source]

Return True if this Node is the first child of its parent.

Fails if called on the root element.

is_ancestor_of(node)[source]

Return True if this Node is an ancestor of the other Node.

ancestors(upto=None)[source]

Climb the tree up over the parents.

If upto is given and it is one of the ancestors, stop after yielding that ancestor. Otherwise iteration stops at the root node.

ancestors_with_index(upto=None)[source]

Yield the ancestors(upto), and the index of each node in the parent.

common_ancestor(other)[source]

Return the common ancestor with the Context or Token.

depth()[source]

Return the number of ancestors.

left_sibling()[source]

Return the left sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_sibling()[source]

Return the right sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

left_siblings()[source]

Yield the left siblings of this node in reverse order, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

right_siblings()[source]

Yield the right siblings of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

next_token()[source]

Return the following Token, if any.

previous_token()[source]

Return the preceding Token, if any.

forward(upto=None)[source]

Yield all Tokens in forward direction.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

backward(upto=None)[source]

Yield all Tokens in backward direction.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

property query

Query this node in different ways; see the query module.

delete()[source]

Remove this node from its parent.

If the parent would become empty, it is removed too. Returns the first non-empty ancestor.

Members of Token

Token has the following additional methods and attributes for node traversal:

class Token[source]
action

The action the Token was instantiated with

group

The group the token belongs to. Normally None, but in some cases this attribute is a tuple of Tokens that form a group together. See below.

equals(other)[source]

Return True if the other Token has the same text and action attributes and the same context ancestry (see also state_matches()).

Note that the pos attribute is not compared.

state_matches(other)[source]

Return True if the other Token has the same lexicons in the ancestors.

forward_including(upto=None)[source]

Yield all tokens in forward direction, including self.

forward_until_including(other)[source]

Yield all tokens starting with us and upto and including the other.

backward_including(upto=None)[source]

Yield all tokens in backward direction, including self.

common_ancestor_with_trail(other)[source]

Return a three-tuple(context, trail_self, trail_other).

The context is the common ancestor such as returned by common_ancestor, if any. trail_self is a tuple of indices from the common ancestor upto self, and trail_other is a tuple of indices from the same ancestor upto the other Token.

If there is no common ancestor, all three are None. But normally, all nodes share the root context, so that will normally be the upmost common ancestor.

target()[source]

Return the first context directly to the right of this Token.

The context should be the right sibling of the token, or of any of its ancestors. If the token is part of a group, the context is found immediately next to the last member of the group. The found context may also be a child of the grand-parents of this token, in case the target popped contexts first.

In all cases, the returned context is the one started by the target in the lexicon rule that created this token.

Members of Context

Context builds on the Python list() builtin, so it has all the methods list() provides. And it has the following addtional methods and attributes for node traversal:

class Context[source]
lexicon

The lexicon that created this Context

first_token()[source]

Return our first Token.

last_token()[source]

Return our last token.

find_token(pos)[source]

Return the Token at or to the right of position.

find_token_left(pos)[source]

Return the Token at or to the left of position.

find_token_after(pos)[source]

Return the first token completely right from pos.

Returns None if there is no token right from pos.

find_token_before(pos)[source]

Return the last token completely left from pos.

Returns None if there is no token left from pos.

source()[source]

Return the first Token, if any, when going to the left from this context.

The returned token is the one that created us, that this context the target is for. If the token is member of a group, the first group member is returned.

tokens()[source]

Yield all Tokens, descending into nested Contexts.

tokens_bw()[source]

Yield all Tokens, descending into nested Contexts, in backward direction.

Often, when dealing with the tree structure, you want to know whether we have a Token or a Context. Instead of calling:

if isinstance(node, parce.tree.Token):
    do_something()

two readonly attributes are available, is_token and is_context. The first is only and always true in Token instances, the other in Context instances:

if node.is_token:
    do_something()

Grouped Tokens

When a dynamic action is used in a rule, and it generates more than one Token from the same regular expression match, these Tokens form a group, each having their index in the group in the group attribute. That attribute is read-only and None for normal Tokens. Grouped tokens are always adjacent and in the same Context.

Normally you don’t have to do much with this information, but parce needs to know this, because if you edit a text, parce can’t start reparsing at a token that is not the first of its group, because the whole group was created from one regular expression match.

But just in case, if you want to be sure you have the first member of a Token group:

if token.group:
    # group is not None or 0
    for token in token.left_siblings():
        if not token.group:
            break

Querying the tree structure

Besides the various find methods, there is another powerful way to search for Tokens and Contexts in the tree, the query property of every Token or Context.

The query property of both Token and Context returns a Query object which is a generator initially yielding just that Token or Context:

>>> for node in tree.query:
...     print(node)
...
<Context Nonsense.root at 1-108 (19 children)>

But the Query object has powerful methods that modify the stream of nodes yielded by the generator. All these methods return a new Query object, so queries can be chained in an XPath-like fashion. For example:

>>> for node in tree.query[:3]:
...     print (node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>

The [:3] operator picks the first three nodes of every node yielded by the previous generator. You can use [:] or .children to get all children of every node:

>>> for node in tree.query.children:
...     print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>
<Token '3' at 16:17 (Literal.Number)>
<Token 'numbers' at 18:25 (Text)>
<Token 'and' at 26:29 (Text)>
<Token '1' at 30:31 (Literal.Number)>
<Token '"' at 32:33 (Literal.String)>
<Context Nonsense.string at 33-67 (2 children)>
<Token ',' at 67:68 (Delimiter)>
<Token 'and' at 69:72 (Text)>
<Token '1' at 73:74 (Literal.Number)>
<Token '%' at 75:76 (Comment)>
<Context Nonsense.comment at 76-89 (1 child)>
<Token 'ends' at 90:94 (Text)>
<Token 'on' at 95:97 (Text)>
<Token 'a' at 98:99 (Text)>
<Token 'newline' at 100:107 (Text)>
<Token '.' at 107:108 (Delimiter)>

The main use of query is of course to narrow down a list of nodes to the ones we’re really looking for. You can use a query to find Tokens with a certain action:

>>> for node in tree.query.children.action(Comment):
...     print(node)
...
<Token '%' at 75:76 (Comment)>

Instead of children, we can use all, which descends in all child contexts:

>>> for node in tree.query.all.action(Comment):
...     print(node)
...
<Token '%' at 75:76 (Comment)>
<Token ' comment that' at 76:89 (Comment)>

Now it also reaches the token that resides in the Nonsense.comment Context. Let’s find tokens with certain text:

>>> for node in tree.query.all.containing('o'):
...     print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
<Token ' comment that' at 76:89 (Comment)>
<Token 'on' at 95:97 (Text)>

Besides containing(), we also have startingwith(), endingwith() and matching() which can find tokens matching a regular expression.

The real power of query is to combine things. The following query selects tokens with action Number, but only if they are immediately followed by a Text token:

>>> for node in tree.query.all.action(Text).left.action(Number):
...     print(node)
...
<Token '3' at 16:17 (Literal.Number)>

Here is a list of all the queries that navigate:

And this is a list of the queries that narrow down the result set:

The special is_not operator inverts the meaning of the next query, e.g.:

n.query.all.is_not.startingwith("text")

The following query methods can be inverted by prepending is_not:

For convenience, there are some “endpoint” methods for a query that make it easier in some cases to process the results:

dump()

for debugging, dumps all resulting nodes to standard output

list()

aggregates the result set in a list.

count()

returns the number of nodes in the result set.

pick()

picks the first result, or returns the default if the result set was empty.

pick_last()

exhausts the query generator and returns the last result, or the default if there are no results.

range()

returns the text range as a tuple (pos, end) the result set encompasses

Finally, there is one method that actually changes the tree:

delete()

deletes all selected nodes from their parents. If a context would become empty, it is deleted as well, instead of its children.

Additional information can be found in the query module’s documentation.