The treebuilder module

This module defines classes and functions to build a tree structure by parsing a text string.

To get the tree of tokens using a particular root lexicon from a string of text, use build_tree().

A more advanced approach is using the TreeBuilder, which can build a tree in one go as well, but is also capable of updating an existing tree when the text changes on a particular position, e.g. while typing in a text editor. In this case, tokens in front of the modified region are reused (carefully checking whether changes affect earlier regions), and also tokens at the end of the modified region are reused, if they have the same context ancestry.

TreeBuilder also reports the start and end position of the updated region, and the lexicons that were left open at the end, which in some languages can mean that a document or a certain structure is incomplete.

The TreeBuilder is designed so that it is possible to perform tokenizing in a background thread, and even interrupt tokenizing when changes are to be applied while processing previous changes.

The BackgroundTreeBuilder provides an implementation using Python threads.

build_tree(root_lexicon, text, pos=0)[source]

Build and return a tree in one go.

class TreeBuilder(root_lexicon=None)[source]

Bases: parce.util.Observable

Build a tree from parsing the text.

The root node of the tree is in the root instance attribute. This root context is never replaced, although its lexicon may change and of course its children.

Call rebuild() to build or rebuild the tree. This method stores the desired changes to the tree and calls start_processing(), which can be re-implemented to support asynchronous tree building.

The actual building of a tree happens in build_new_tree() which builds a (replacement) tree without making any changes yet to the current tree.

The result of build_new_tree() is a tuple of arguments that is used used when calling replace_tree(), which integrates the updated subtree in the main tree structure. This method sets three instance attributes:

start, end:

indicate the region the tokens were changed. After build(), start is always 0 and end = len(text), but after rebuild(), these values indicate the range that was actually re-tokenized.

lexicons:

the list of open lexicons (excluding the root lexicon) at the end of the document. This way you can see in which lexicon parsing ended.

If a tree was rebuilt, and old tail tokens were reused, the lexicons variable is not set, meaning that the old value is still valid. If the TreeBuilder was not used before, lexicons is an empty tuple.

No other variables or state are kept, so if you don’t need the above information anymore, you can throw away the TreeBuilder after use.

During the building process, the TreeBuilder emits certain events you can subscribe to, using the connect() method provided by the Observable class that’s mixed into this TreeBuilder class.

The following events are emitted, with following arguments:

"started":

emitted when a (re)build starts; the handler is called without arguments

"replace":

emitted just before the tree actually changes (while the new tree is being built, the tree is still unchanged and accessible, but between the "replace" and "finished" events the tree is in an inconsistent state)

"finished":

emitted when a (re)build has finished; the handler is called without arguments

"updated":

emitted when a (re)build has finished; the handler is called with two arguments: start, end, that denote the updated range

"peek":

emitted by the default implementation of the peek() method, the handler is called with two arguments: start, tree

"invalidate":

emitted by the default implementation of the invalidate_context() method, the handler is called with the Context that needs to be invalidated

For example, to get notified when a build process starts:

>>> b = TreeBuilder(MyLang.root)
>>> def hi_there():
...     print("started")
...
>>> b.connect("started", hi_there)
>>> b.rebuild("some boring text")
started
>>>
start = 0
end = 0
lexicons = ()
peek_threshold = 0

set to a value > 0 to get peek() called during building

tree(text)[source]

Convenience method to build a tree and return the root node.

rebuild(text, root_lexicon=False, start=0, removed=0, added=None)[source]

Tokenize the modified part of the text again and update the tree.

The arguments:

text

The text to parse. Always give the entire text, also when you only actually changed a small part. The tree builder needs to check text before and after the changed region, and possibly re-parse more text.

root_lexicon

The root lexicon to use (default: False). False means no change; can be None or any Lexicon. If not False, the tree is always rebuilt completely.

start

Position of the change (default: 0)

removed

The number of removed characters (default: 0)

added

The number of added characters (default: None, which means all characters from start to the end of the text)

Calls build_new_tree() and replace_tree() to do the actual work.

build_new_tree(text, root_lexicon, start, removed, added)[source]

Build a new tree without yet modifying the current tree.

Tokens from the current tree are reused as much as possible. From tokens at the tail (after the end of the modified region) the pos attribute is updated if necessary.

The arguments:

text

The text to parse. Always the entire text, also when only a small portion was changed.

root_lexicon

The root lexicon to use. False means no change; can be None or any Lexicon. If not False, the tree is always rebuilt completely.

start

Position of the change.

removed

The number of removed characters.

added

The number of added characters.

Returns a Result five-tuple with tree, start, end, offset and lexicons values. The start and end are the insert positions in the old tree.

The new tree is intended to replace a part of, or the whole old tree. If start == 0 and lexicons is not None; the whole tree can be replaced. (In this case; the root lexicon might have changed!) Use replace_tree() to insert the result tree in the old tree.

If start > 0, tokens in the old tree before start are to be preserved.

If lexicons is None, old tail tokens after end must be reused, and the old list of open lexicons is still relevant. The offset then gives the position change for the tokens that are reused.

replace_tree(result)[source]

Modify the tree using the result from build_new_tree().

In most types of GUI applications, this method should be called in the main (GUI) thread.

The changes are delegated to the various replace_ methods, which can be reimplemented to get fine-grained monitoring of and control over the tree-replacing process.

Additionally, this method calls invalidate_context() with the youngest Context that had children removed or added.

replace_nodes(context, slice_, nodes)[source]

Replace the context’s slice with new nodes.

This method is called by replace_tree(). You can reimplement this method to notify others of the change.

replace_root_lexicon(lexicon)[source]

Set the root lexicon.

This method is called by replace_tree(). You can reimplement this method to notify others of the change.

replace_pos(context, index, offset)[source]

Adjust the pos attribute of all tokens in context[index:].

This method is called by replace_tree(). You can reimplement this method to notify others of the change.

invalidate_context(context)[source]

Called with the younghest Context that had children are removed or added.

This means that the meaning of this context probably has changed, for example when you want to transform the context to some other data structure, and that the ancestors also need to be invalidated.

The default implementation of this method emits the invalidate event, see connect().

get_root(wait=False, callback=None, args=(), kwargs={})[source]

Return the root element of the completed tree.

This is simply the root instance attribute, but this method only returns the tree when the busy attribute is False.

If wait is True, this call blocks until tokenizing is done, and the full tree is returned. If wait is False, None is returned if the tree is still busy being built.

If a callback is given and tokenizing is still busy, that callback is called once when tokenizing is ready. If given, args and kwargs are the arguments the callback is called with, defaulting to () and {}, respectively.

Note that, for the lifetime of a TreeBuilder, the root element is always the same. The root element is also accessible in the root attribute. But using this method you can be sure that you are dealing with a complete and fully intact tree.

get_changes()[source]

Get and combine the stored change requests in a Changes object.

This may only be called from the same thread that also performs the rebuild().

start_processing()[source]

Called when there are recorded changes to process.

The default implementation read all build stages from the process() generator until exhausted. You can inherit from this method to call it e.g. in a background thread.

process()[source]

Process all changes and update the tree.

This method behaves as a generator coroutine, instead of simply calling this method, you should iterate over its output, which reports which stage the process is at.

Yields “build” when about to build a new tree; “replace” when about to replace a new tree; (which can be repeated); “finish” when finished looping, and “done” at the very end.

When re-implementing start_processing(), you can choose to decide which stages are to be run in a background thread and which in a main (GUI) thread.

You should exhaust the generator fully.

wait()[source]

Implement to wait for completion if a background job is running.

The default implementation does nothing, and immediately returns.

peek(start, tree)[source]

This is called from build_new_tree() with a sneak preview tree.

This can be used to get a small tree before the new tree is built completely, which is useful to update e.g. highlighting of a small portion of a document that is edited by a user, instead of waiting on the whole tree to update (which may cause slow highlighting updates).

When build_new_tree (the build stage) is called from a background thread, this method will also be called from that same thread.

Enable the peek() feature by setting the peek_threshold attribute to a value > 0. E.g. the value 1000 will cause the peek() method to be called with a tree that encompasses at least 1000 characters (starting with the start position).

The tree that is given, is a copy of the current tree. It is safe to use it in another thread, although its contents are not valid anymore when the build has finished, or when a build is restarted, causing peek() to be called a second time. (A build is restarted when there are new changes close to the position the build originally started.)

The default implementation of this method emits the peek event, see connect().

lock(acquire)[source]

Acquire lock (True) or release lock (False). Does nothing by default.

If you want to run the full update and replace jobs in a background thread, you may need locking, to prevent changes from going unnoticed.

process_started()[source]

Called at the start ot the tree building process.

The default implementation of this method emits the started event, see connect().

process_finished()[source]

Called when tree building is done.

The default implementation of this method emits the updated(start, end) and finished events, see connect().

class BackgroundTreeBuilder(root_lexicon=None)[source]

Bases: parce.treebuilder.TreeBuilder

A TreeBuilder that can tokenize a text in a Python thread.

In BackgroundTreeBuilder, rebuild() returns immediately, because start_processing() has been reimplemented to call itself in a background thread.

You can continue adding changes while previous changes are processed; the tree builder will immediately adapt to the new changes.

To be sure you get a completed tree, call get_root(True).

lock(acquire)[source]

Reimplemented to actually lock/unlock.

start_processing()[source]

Reimplemented to call start_processing in a background thread.

wait()[source]

Reimplemented to await our background thread if active.

process_finished()[source]

Reimplemented to clear the job attribute.