Accessing the Tree Structure¶
When you have parsed text, the result is a tree structure of Tokens, contained by Contexts, which may be nested in other Contexts.
Let’s look at the generated token tree of the simple example of the Getting started section:
>>> tree.dump()
<Context Nonsense.root at 1-108 (19 children)>
├╴<Token 'Some' at 1:5 (Text)>
├╴<Token 'text' at 6:10 (Text)>
├╴<Token 'with' at 11:15 (Text)>
├╴<Token '3' at 16:17 (Literal.Number)>
├╴<Token 'numbers' at 18:25 (Text)>
├╴<Token 'and' at 26:29 (Text)>
├╴<Token '1' at 30:31 (Literal.Number)>
├╴<Token '"' at 32:33 (Literal.String)>
├╴<Context Nonsense.string at 33-67 (2 children)>
│ ├╴<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
│ ╰╴<Token '"' at 66:67 (Literal.String)>
├╴<Token ',' at 67:68 (Delimiter)>
├╴<Token 'and' at 69:72 (Text)>
├╴<Token '1' at 73:74 (Literal.Number)>
├╴<Token '%' at 75:76 (Comment)>
├╴<Context Nonsense.comment at 76-89 (1 child)>
│ ╰╴<Token ' comment that' at 76:89 (Comment)>
├╴<Token 'ends' at 90:94 (Text)>
├╴<Token 'on' at 95:97 (Text)>
├╴<Token 'a' at 98:99 (Text)>
├╴<Token 'newline' at 100:107 (Text)>
╰╴<Token '.' at 107:108 (Delimiter)>
Token¶
We see that the Token instances represent the matched text. Every Token has
the matched text in the text
attribute, the position where it is in the
source text in the pos
attribute, and the action it was given in the
action
attribute. Besides that, Tokens also have an end
attribute,
which is actually a property and basically returns self.pos +
len(self.text)
.
Although a Token is not a string, you can test for equality:
if token == "bla":
# do something
Also, you can check if some text is in some Context:
if 'and' in tree:
# do some_thing if 'and' is in the root context.
Context¶
A Context is basically a Python list, and it has the lexicon that created it
in the lexicon
attribute. The root of the tree is called the root
context, it carries the root lexicon. You can access its
child contexts and tokens with item or slice notation:
>>> print(tree[2])
<Token 'with' at 11:15 (Text)>
Besides that, Context has a pos
and end
attribute, which
refer to the pos
value of the first Token in the context, and the end
value of the last Token in the context (or a sub-context).
Just like is is possible with Token to compare with a string, a Context can be compared to a Lexicon object. So it is possible to write:
>>> tree[8] == Nonsense.string
True
>>> Nonsense.comment in tree
True
A Context is never empty: if the parser switches to a new lexicon, but the lexicon does not generate any Token, the empty Context is discarded. Only the root context can be empty.
Traversing the tree structure¶
Both Token and Context have a parent
atribute that points to its parent
Context. Only for the root context, parent
is None
.
Token
and Context
both inherit Node
,
which defines a lot of useful methods to traverse the tree structure.
Members of Token¶
The most important Token
methods and attributes:
-
Token.
text
The text of this token
-
Token.
action
The action specified by the lexicon rule that created the token
-
Token.
forward_including
(upto=None)[source] Yield all tokens in forward direction, including self.
-
Token.
forward_until_including
(other)[source] Yield all tokens starting with us and upto and including the other.
-
Token.
backward_including
(upto=None)[source] Yield all tokens in backward direction, including self.
Members of Context¶
Context
builds on the Python list
builtin, so it has all the
methods list()
provides. Some of the addtional methods and attributes it
provides are:
-
Context.
lexicon
The lexicon this context was instantiated with.
-
Context.
tokens
(reverse=False)[source] Yield all Tokens, descending into nested Contexts.
If
reverse
is set to True, yield all tokens in backward direction.
-
Context.
first_token
()[source] Return our first Token.
-
Context.
last_token
()[source] Return our last token.
-
Context.
find_token
(pos)[source] Return the Token at or to the right of position.
Returns None if there is no such token.
-
Context.
find_token_left
(pos)[source] Return the Token at or to the left of position.
Returns None if there is no such token.
-
Context.
find_token_after
(pos)[source] Return the first token completely right from pos.
Returns None if there is no token right from pos.
-
Context.
find_token_before
(pos)[source] Return the last token completely left from pos.
Returns None if there is no token left from pos.
-
Context.
range
(start=0, end=None)[source] Return a
Range
.The ancestor of the range is the common ancestor of the tokens found at start and end (or the context itself if start or end fall outside this context). If start is 0 and end is None, the range encompasses the full context.
Returns None if this context is empty.
Often, when dealing with the tree structure, you want to know whether we have a Token or a Context. Instead of calling:
if isinstance(node, parce.tree.Token):
do_something()
two readonly attributes are available, is_token and is_context. The first is only and always true in Token instances, the other in Context instances:
if node.is_token:
do_something()
Grouped Tokens¶
When a dynamic action is used in a rule, and it generates more than one Token
from the same regular expression match, these Tokens form a group, each having
their index in the group in the group
attribute. That attribute is
read-only and None
for normal Tokens. GroupToken
instances are
always adjacent and in the same Context, and the last index is negative, to
indicate it is the last.
Normally you don’t have to do much with this information, but parce needs to know this, because if you edit a text, parce can’t start reparsing at a token that is not the first of its group, because the whole group was created from one regular expression match.
But just in case, if you want to be sure you have the first member of a Token group:
if token.group:
# group is not None or 0
token = token.get_group_start()
Querying the tree structure¶
Besides the various find methods, there is another powerful way to search
for Tokens and Contexts in the tree, the query
property of every Token or
Context.
The query
property of both Token and Context returns a Query
object
which is a generator initially yielding just that Token or Context:
>>> for node in tree.query:
... print(node)
...
<Context Nonsense.root at 1-108 (19 children)>
But the Query object has powerful methods that modify the stream of nodes yielded by the generator. All these methods return a new Query object, so queries can be chained in an XPath-like fashion. For example:
>>> for node in tree.query[:3]:
... print (node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>
The [:3]
operator picks the first three nodes of every node yielded
by the previous generator. You can use [:]
or .children
to get
all children of every node:
>>> for node in tree.query.children:
... print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'text' at 6:10 (Text)>
<Token 'with' at 11:15 (Text)>
<Token '3' at 16:17 (Literal.Number)>
<Token 'numbers' at 18:25 (Text)>
<Token 'and' at 26:29 (Text)>
<Token '1' at 30:31 (Literal.Number)>
<Token '"' at 32:33 (Literal.String)>
<Context Nonsense.string at 33-67 (2 children)>
<Token ',' at 67:68 (Delimiter)>
<Token 'and' at 69:72 (Text)>
<Token '1' at 73:74 (Literal.Number)>
<Token '%' at 75:76 (Comment)>
<Context Nonsense.comment at 76-89 (1 child)>
<Token 'ends' at 90:94 (Text)>
<Token 'on' at 95:97 (Text)>
<Token 'a' at 98:99 (Text)>
<Token 'newline' at 100:107 (Text)>
<Token '.' at 107:108 (Delimiter)>
The main use of query
is of course to narrow down a list of nodes to the
ones we’re really looking for. You can use a query to find Tokens with a
certain action:
>>> for node in tree.query.children.action(Comment):
... print(node)
...
<Token '%' at 75:76 (Comment)>
Instead of children
, we can use all
, which descends in all child
contexts:
>>> for node in tree.query.all.action(Comment):
... print(node)
...
<Token '%' at 75:76 (Comment)>
<Token ' comment that' at 76:89 (Comment)>
Now it also reaches the token that resides in the Nonsense.comment Context. Let’s find tokens with certain text:
>>> for node in tree.query.all.containing('o'):
... print(node)
...
<Token 'Some' at 1:5 (Text)>
<Token 'string inside\nover multiple '... at 33:66 (Literal.String)>
<Token ' comment that' at 76:89 (Comment)>
<Token 'on' at 95:97 (Text)>
Besides containing()
, we also have startingwith()
, endingwith()
and matching()
which can find tokens matching a regular expression.
The real power of query
is to combine things. The following query selects
tokens with action Number, but only if they are immediately followed by a Text
token:
>>> for node in tree.query.all.action(Text).left.action(Number):
... print(node)
...
<Token '3' at 16:17 (Literal.Number)>
Here is a list of all the queries that navigate:
all
,
children
,
parent
,
ancestors
,
next
,
previous
,
forward
,
backward
,
right
,
left
,
right_siblings
,
left_siblings
,
[n]
,
[n:m]
,
first
,
last
, and
map()
,
And this is a list of the queries that narrow down the result set:
tokens
,
contexts
,
uniq
,
remove_ancestors
,
remove_descendants
,
slice()
and
filter()
.
The special is_not
operator inverts the meaning of the
next query, e.g.:
n.query.all.is_not.startingwith("text")
The following query methods can be inverted by prepending is_not:
len()
,
in_range()
,
(lexicon)
,
(lexicon, lexicon2, ...)
,
("text")
,
("text", "text2", ...)
,
startingwith()
,
endingwith()
,
containing()
,
matching()
,
action()
and
in_action()
.
For convenience, there are some “endpoint” methods for a query that make it easier in some cases to process the results:
dump()
for debugging, dumps all resulting nodes to standard output
count()
returns the number of nodes in the result set.
pick()
picks the first result, or returns the default if the result set was empty.
pick_last()
exhausts the query generator and returns the last result, or the default if there are no results.
range()
returns the text range as a tuple (pos, end) the result set encompasses
Finally, there is one method that actually changes the tree:
delete()
deletes all selected nodes from their parents. If a context would become empty, it is deleted as well, instead of its children.
Additional information can be found in the query
module’s
documentation.