The treedocument module

A Document mixin that keeps all text tokenized.

When the text is modified, retokenizes only the modified part.

class TreeDocumentMixin(builder)[source]

Bases: object

Encapsulates a full tokenized text string.

Combine this class with a subclass of AbstractDocument (see the document module).

Everytime the text is modified, only the modified part is retokenized. If that changes the lexicon in which the last part (after the modified part) starts, that part is also retokenized, until the state (the list of active lexicons) matches the state of existing tokens.

builder()[source]

Return the TreeBuilder we were instantiated with.

get_root(wait=False, callback=None, args=(), kwargs={})[source]

Get the root element of the completed tree.

If wait is True, this call blocks until tokenizing is done, and the full tree is returned. If wait is False, None is returned if the tree is still busy being built.

If a callback is given and a BackgroundTreeBuilder was used and tokenizing is still busy, that callback is called once when tokenizing is ready. If given, args and kwargs are the arguments the callback is called with, defaulting to () and {}, respectively.

Note that, for the lifetime of a Document, the root element is always the same. But using this method you can be sure that you are dealing with a complete and fully intact tree.

root_lexicon()[source]

Return the currently set root lexicon.

set_root_lexicon(root_lexicon)[source]

Set the root lexicon to use to tokenize the text.

open_lexicons()[source]

Return the list of lexicons that were left open at the end of the text.

The root lexicon is not included; if parsing ended in the root lexicon, this list is empty, and the text can be considered “complete.”

modified_range()[source]

Return a two-tuple(start, end) describing the range that was re-tokenized.

contents_changed(start, removed, added)[source]

Called after modification of the text, retokenizes the modified part.