The treebuilder module¶
This module defines classes and functions to build a tree structure by parsing a text string.
To get the tree of tokens using a particular root lexicon from a string
of text, use build_tree()
.
A more advanced approach is using the TreeBuilder
, which can build a
tree in one go as well, but is also capable of updating an existing tree when
the text changes on a particular position, e.g. while typing in a text editor.
In this case, tokens in front of the modified region are reused (carefully
checking whether changes affect earlier regions), and also tokens at the end of
the modified region are reused, if they have the same context ancestry.
TreeBuilder also reports the start and end position of the updated region, and the lexicons that were left open at the end, which in some languages can mean that a document or a certain structure is incomplete.
The TreeBuilder is designed so that it is possible to perform tokenizing in a background thread, and even interrupt tokenizing when changes are to be applied while processing previous changes.
-
class
TreeBuilder
(root_lexicon=None)[source]¶ Bases:
parce.util.Observable
Build a tree from parsing the text.
The root node of the tree is in the
root
instance attribute. This root context is never replaced, although its lexicon may change and of course its children.Call
rebuild()
to build or rebuild the tree. This method stores the desired changes to the tree and callsstart_processing()
, which can be re-implemented to support asynchronous tree building.The actual building of a tree happens in
build_new_tree()
which builds a (replacement) tree without making any changes yet to the current tree.The result of
build_new_tree()
is a tuple of arguments that is used used when callingreplace_tree()
, which integrates the updated subtree in the main tree structure. This method sets three instance attributes:start
,end
:indicate the region the tokens were changed. After build(), start is always 0 and end = len(text), but after rebuild(), these values indicate the range that was actually re-tokenized.
lexicons
:the list of open lexicons (excluding the root lexicon) at the end of the document. This way you can see in which lexicon parsing ended.
If a tree was rebuilt, and old tail tokens were reused, the lexicons variable is not set, meaning that the old value is still valid. If the TreeBuilder was not used before, lexicons is an empty tuple.
No other variables or state are kept, so if you don’t need the above information anymore, you can throw away the TreeBuilder after use.
During the building process, the TreeBuilder emits certain events you can subscribe to, using the
connect()
method provided by theObservable
class that’s mixed into this TreeBuilder class.The following events are emitted, with following arguments:
"started"
:emitted when a (re)build starts; the handler is called without arguments
"replace"
:emitted just before the tree actually changes (while the new tree is being built, the tree is still unchanged and accessible, but between the
"replace"
and"finished"
events the tree is in an inconsistent state)"finished"
:emitted when a (re)build has finished; the handler is called without arguments
"updated"
:emitted when a (re)build has finished; the handler is called with two arguments:
start
,end
, that denote the updated range"peek"
:emitted by the default implementation of the
peek()
method, the handler is called with two arguments:start
,tree
"invalidate"
:emitted by the default implementation of the
invalidate_context()
method, the handler is called with the Context that needs to be invalidated
For example, to get notified when a build process starts:
>>> b = TreeBuilder(MyLang.root) >>> def hi_there(): ... print("started") ... >>> b.connect("started", hi_there) >>> b.rebuild("some boring text") started >>>
-
start
= 0¶
-
end
= 0¶
-
lexicons
= ()¶
-
rebuild
(text, root_lexicon=False, start=0, removed=0, added=None)[source]¶ Tokenize the modified part of the text again and update the tree.
The arguments:
text
The text to parse. Always give the entire text, also when you only actually changed a small part. The tree builder needs to check text before and after the changed region, and possibly re-parse more text.
root_lexicon
The root lexicon to use (default: False). False means no change; can be None or any Lexicon. If not False, the tree is always rebuilt completely.
start
Position of the change (default: 0)
removed
The number of removed characters (default: 0)
added
The number of added characters (default: None, which means all characters from start to the end of the text)
Calls
build_new_tree()
andreplace_tree()
to do the actual work.
-
add_changes
(text, root_lexicon, start, removed, added)[source]¶ Add the changes to our changes list, but do not rebuild immediately.
The arguments are the same as for
rebuild()
.
-
build_new_tree
(text, root_lexicon, start, removed, added)[source]¶ Build a new tree without yet modifying the current tree.
This method is called by
process()
. Returns aBuildResult
five-tuple withtree
,start
,end
,offset
andlexicons
values. Thestart
andend
are the insert positions in the old tree.Tokens from the current tree are reused as much as possible. From tokens at the tail (after the end of the modified region) the pos attribute is updated if necessary.
The new
tree
is intended to replace a part of, or the whole old tree. Ifstart
== 0 andlexicons
is not None; the whole tree can be replaced. (In this case; the root lexicon might have changed!) Usereplace_tree()
to insert the result tree in the old tree.If
start
> 0, tokens in the old tree before start are to be preserved.If
lexicons
is None, old tail tokens afterend
must be reused, and the old list of open lexicons is still relevant. Theoffset
then gives the position change for the tokens that are reused.
-
replace_tree
(result)[source]¶ Modify the tree using the result from
build_new_tree()
.In most types of GUI applications, this method should be called in the main (GUI) thread.
The changes are delegated to the various
replace_
methods, which can be reimplemented to get fine-grained monitoring of and control over the tree-replacing process.Additionally, this method calls
invalidate_context()
with the youngest Context that had children removed or added.
-
replace_nodes
(context, slice_, nodes)[source]¶ Replace the context’s slice with new nodes.
This method is called by
replace_tree()
. You can reimplement this method to notify others of the change.
-
replace_root_lexicon
(lexicon)[source]¶ Set the root lexicon.
This method is called by
replace_tree()
. You can reimplement this method to notify others of the change.
-
replace_pos
(context, index, offset)[source]¶ Adjust the pos attribute of all tokens in
context[index:]
.This method is called by
replace_tree()
. You can reimplement this method to notify others of the change.
-
invalidate_context
(context)[source]¶ Called with the younghest Context that had children are removed or added.
This means that the meaning of this context probably has changed, for example when you want to transform the context to some other data structure, and that the ancestors also need to be invalidated.
The default implementation of this method emits the
invalidate
event, seeconnect()
.
-
get_changes
()[source]¶ Get and combine the stored change requests in a Changes object.
This may only be called from the same thread that also performs the
rebuild()
.
-
start_processing
()[source]¶ Called when there are recorded changes to process.
The default implementation read all build stages from the
process()
generator until exhausted. You can inherit from this method to call it e.g. in a background thread.
-
process
()[source]¶ Process all changes and update the tree.
This method behaves as a generator coroutine, instead of simply calling this method, you should iterate over its output, which reports which stage the process is at.
Yields “build” when about to build a new tree; “replace” when about to replace a new tree; (which can be repeated); “finish” when finished looping, and “done” at the very end.
When re-implementing
start_processing()
, you can choose to decide which stages are to be run in a background thread and which in a main (GUI) thread.You should exhaust the generator fully.
-
peek
(start, tree)[source]¶ This is called from
build_new_tree()
with a sneak preview tree.This can be used to get a small tree before the new tree is built completely, which is useful to update e.g. highlighting of a small portion of a document that is edited by a user, instead of waiting on the whole tree to update (which may cause slow highlighting updates).
When build_new_tree (the build stage) is called from a background thread, this method will also be called from that same thread.
Enable the
peek()
feature by setting thepeek_threshold
attribute to a value > 0. E.g. the value 1000 will cause thepeek()
method to be called with a tree that encompasses at least 1000 characters (starting with the start position).The tree that is given, is a copy of the current tree. It is safe to use it in another thread, although its contents are not valid anymore when the build has finished, or when a build is restarted, causing peek() to be called a second time. (A build is restarted when there are new changes close to the position the build originally started.)
The default implementation of this method emits the
peek
event, seeconnect()
.