Accessing the Tree Structure

When you have parsed text, the result is a tree structure of Tokens, contained by Contexts, which may be nested in other Contexts.

Let’s look at the generated token tree of the simple example of the Getting started section:

>>> tree.dump()
<Context Nonsense.root at 1-108 (19 children)>
 ├╴<Token 'Some' at 1:5 (Text)>
 ├╴<Token 'text' at 6:10 (Text)>
 ├╴<Token 'with' at 11:15 (Text)>
 ├╴<Token '3' at 16:17 (Literal.Number)>
 ├╴<Token 'numbers' at 18:25 (Text)>
 ├╴<Token 'and' at 26:29 (Text)>
 ├╴<Token '1' at 30:31 (Literal.Number)>
 ├╴<Token '"' at 32:33 (Literal.String)>
 ├╴<Context Nonsense.string at 33-67 (2 children)>
 │  ├╴<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
 │  ╰╴<Token '"' at 66:67 (Literal.String)>
 ├╴<Token ',' at 67:68 (Delimiter)>
 ├╴<Token 'and' at 69:72 (Text)>
 ├╴<Token '1' at 73:74 (Literal.Number)>
 ├╴<Token '%' at 75:76 (Comment)>
 ├╴<Context Nonsense.comment at 76-89 (1 child)>
 │  ╰╴<Token ' comment that' at 76:89 (Comment)>
 ├╴<Token 'ends' at 90:94 (Text)>
 ├╴<Token 'on' at 95:97 (Text)>
 ├╴<Token 'a' at 98:99 (Text)>
 ├╴<Token 'newline' at 100:107 (Text)>
 ╰╴<Token '.' at 107:108 (Delimiter)>

Token

We see that the Token instances represent the matched text. Every Token has the matched text in the text attribute, the position where it is in the source text in the pos attribute, and the action it was given in the action attribute. Besides that, Tokens also have an end attribute, which is actually a property and basically returns self.pos + len(self.text).

Although a Token is not a string, you can test for equality:

if token == "bla":
    # do something

Also, you can check if some text is in some Context:

if 'and' in tree:
    # do some_thing if 'and' is in the root context.

Context

A Context is basically a Python list, and it has the lexicon that created it in the lexicon attribute. The root of the tree is called the root context, it carries the root lexicon. You can access its child contexts and tokens with item or slice notation:

>>> print(tree[2])
<Token 'with' at 11:15 (Text)>

Besides that, Context has a pos and end attribute, which refer to the pos value of the first Token in the context, and the end value of the last Token in the context (or a sub-context).

Just like is is possible with Token to compare with a string, a Context can be compared to a Lexicon object. So it is possible to write:

>>> tree[8] == Nonsense.string
True
>>> Nonsense.comment in tree
True

A Context is never empty: if the parser switches to a new lexicon, but the lexicon does not generate any Token, the empty Context is discarded. Only the root context can be empty.

Traversing the tree structure

Both Token and Context have a parent atribute that points to its parent Context. Only for the root context, parent is None.

Token and Context both inherit Node, which defines a lot of useful methods to traverse the tree structure.

Members shared by Token and Context

These are the most important attributes Token and Context both provide:

Node.parent

The parent Context; the root context has parent None (uses a weak reference, if you don’t keep a reference to the parent, it gets garbage collected)

Node.pos, Node.end

The starting resp. ending position of this node in the source text (for Token a direct attribute, for Context the pos of the first descendant Token)

Node.is_token

False for Context, True for Token

Node.is_context

True for Context, False for Token

These are the most important methods Token and Context both provide:

Node.parent_index()[source]

Return our index in the parent.

This is recommended above using parent.index(self), because this method finds our index using a binary search on position, while the latter is a linear search, which is certainly slower with a large number of children.

Node.root()[source]

Return the root node.

Node.is_first()[source]

Return True if this Node is the first child of its parent.

Fails if called on the root element.

Node.is_last()[source]

Return True if this Node is the last child of its parent.

Fails if called on the root element.

Node.ancestors(upto=None)[source]

Climb the tree up over the parents.

If upto is given and it is one of the ancestors, stop after yielding that ancestor. Otherwise iteration stops at the root node.

Node.is_ancestor_of(node)[source]

Return True if this Node is an ancestor of the other Node.

Node.left_sibling()[source]

Return the left sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

Node.left_siblings()[source]

Yield the left siblings of this node in reverse order, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

Node.right_sibling()[source]

Return the right sibling of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

Node.right_siblings()[source]

Yield the right siblings of this node, if any.

Does not descend in child nodes or ascend upto the parent. Fails if called on the root node.

Node.next_token()[source]

Return the following Token, if any.

Node.previous_token()[source]

Return the preceding Token, if any.

Node.backward(upto=None)[source]

Yield all Tokens in backward direction, starting at the left sibling.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

Node.forward(upto=None)[source]

Yield all Tokens in forward direction, starting at the right sibling.

Descends into child Contexts, and ascends into parent Contexts. If upto is given, does not ascend above that context.

property Node.query

Query this node in different ways; see the query module.

Node.copy(parent=None)[source]

Return a copy of the Node, but with the specified parent.

Members of Token

The most important Token methods and attributes:

Token.text

The text of this token

Token.action

The action specified by the lexicon rule that created the token

Token.forward_including(upto=None)[source]

Yield all tokens in forward direction, including self.

Token.forward_until_including(other)[source]

Yield all tokens starting with us and upto and including the other.

Token.backward_including(upto=None)[source]

Yield all tokens in backward direction, including self.

Token.range(other)[source]

Return a Range from this token upto and including the other.

Returns None if the other Token does not belong to the same tree.

Members of Context

Context builds on the Python list builtin, so it has all the methods list() provides. Some of the addtional methods and attributes it provides are:

Context.lexicon

The lexicon this context was instantiated with.

Context.tokens(reverse=False)[source]

Yield all Tokens, descending into nested Contexts.

If reverse is set to True, yield all tokens in backward direction.

Context.first_token()[source]

Return our first Token.

Context.last_token()[source]

Return our last token.

Context.find_token(pos)[source]

Return the Token at or to the right of position.

Returns None if there is no such token.

Context.find_token_left(pos)[source]

Return the Token at or to the left of position.

Returns None if there is no such token.

Context.find_token_after(pos)[source]

Return the first token completely right from pos.

Returns None if there is no token right from pos.

Context.find_token_before(pos)[source]

Return the last token completely left from pos.

Returns None if there is no token left from pos.

Context.range(start=0, end=None)[source]

Return a Range.

The ancestor of the range is the common ancestor of the tokens found at start and end (or the context itself if start or end fall outside this context). If start is 0 and end is None, the range encompasses the full context.

Returns None if this context is empty.

Often, when dealing with the tree structure, you want to know whether we have a Token or a Context. Instead of calling:

if isinstance(node, parce.tree.Token):
    do_something()

two readonly attributes are available, is_token and is_context. The first is only and always true in Token instances, the other in Context instances:

if node.is_token:
    do_something()

Grouped Tokens

When a dynamic action is used in a rule, and it generates more than one Token from the same regular expression match, these Tokens form a group, each having their index in the group in the group attribute. That attribute is read-only and None for normal Tokens. GroupToken instances are always adjacent and in the same Context, and the last index is negative, to indicate it is the last.

Normally you don’t have to do much with this information, but parce needs to know this, because if you edit a text, parce can’t start reparsing at a token that is not the first of its group, because the whole group was created from one regular expression match.

But just in case, if you want to be sure you have the first member of a Token group:

if token.group:
    # group is not None or 0
    token = token.get_group_start()

Querying the tree structure

Besides the various find methods, there is another powerful way to search for Tokens and Contexts in the tree, the query property of every Token or Context.

The query property of both Token and Context returns a Query object which is a generator initially yielding just that Token or Context:

>>> for node in tree.query:
...     print(node)
...
<Context Nonsense.root at 1-108 (19 children)>

But the Query object has powerful methods that modify the stream of nodes yielded by the generator. All these methods return a new Query object, so queries can be chained in an XPath-like fashion. For example:

>>> for node in tree.query[:3]:
...     print (node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>

The [:3] operator picks the first three nodes of every node yielded by the previous generator. You can use [:] or .children to get all children of every node:

>>> for node in tree.query.children:
...     print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>
<Token '3' at 16:17 (Literal.Number)>
<Token 'numbers' at 18:25 (Text)>
<Token 'and' at 26:29 (Text)>
<Token '1' at 30:31 (Literal.Number)>
<Token '"' at 32:33 (Literal.String)>
<Context Nonsense.string at 33-67 (2 children)>
<Token ',' at 67:68 (Delimiter)>
<Token 'and' at 69:72 (Text)>
<Token '1' at 73:74 (Literal.Number)>
<Token '%' at 75:76 (Comment)>
<Context Nonsense.comment at 76-89 (1 child)>
<Token 'ends' at 90:94 (Text)>
<Token 'on' at 95:97 (Text)>
<Token 'a' at 98:99 (Text)>
<Token 'newline' at 100:107 (Text)>
<Token '.' at 107:108 (Delimiter)>

The main use of query is of course to narrow down a list of nodes to the ones we’re really looking for. You can use a query to find Tokens with a certain action:

>>> for node in tree.query.children.action(Comment):
...     print(node)
...
<Token '%' at 75:76 (Comment)>

Instead of children, we can use all, which descends in all child contexts:

>>> for node in tree.query.all.action(Comment):
...     print(node)
...
<Token '%' at 75:76 (Comment)>
<Token ' comment that' at 76:89 (Comment)>

Now it also reaches the token that resides in the Nonsense.comment Context. Let’s find tokens with certain text:

>>> for node in tree.query.all.containing('o'):
...     print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
<Token ' comment that' at 76:89 (Comment)>
<Token 'on' at 95:97 (Text)>

Besides containing(), we also have startingwith(), endingwith() and matching() which can find tokens matching a regular expression.

The real power of query is to combine things. The following query selects tokens with action Number, but only if they are immediately followed by a Text token:

>>> for node in tree.query.all.action(Text).left.action(Number):
...     print(node)
...
<Token '3' at 16:17 (Literal.Number)>

Here is a list of all the queries that navigate:

all, children, parent, ancestors, next, previous, forward, backward, right, left, right_siblings, left_siblings, [n], [n:m], first, last, and map(),

And this is a list of the queries that narrow down the result set:

tokens, contexts, uniq, remove_ancestors, remove_descendants, slice() and filter().

The special is_not operator inverts the meaning of the next query, e.g.:

n.query.all.is_not.startingwith("text")

The following query methods can be inverted by prepending is_not:

len(), in_range(), (lexicon), (lexicon, lexicon2, ...), ("text"), ("text", "text2", ...), startingwith(), endingwith(), containing(), matching(), action() and in_action().

For convenience, there are some “endpoint” methods for a query that make it easier in some cases to process the results:

dump()

for debugging, dumps all resulting nodes to standard output

count()

returns the number of nodes in the result set.

pick()

picks the first result, or returns the default if the result set was empty.

pick_last()

exhausts the query generator and returns the last result, or the default if there are no results.

range()

returns the text range as a tuple (pos, end) the result set encompasses

Finally, there is one method that actually changes the tree:

delete()

deletes all selected nodes from their parents. If a context would become empty, it is deleted as well, instead of its children.

Additional information can be found in the query module’s documentation.