The Document class#

parce provides a Document which can keep the text, collect changes to the text and internally call the TreeBuilder to process the changes. The Document simply behaves as a mutable string with some extra features.

To instantiate a Document:

>>> import parce
>>> from parce.lang.xml import Xml   # just for example
>>> d = parce.Document()
>>> d.set_root_lexicon(Xml.root)
>>> d.set_text(r'<xml attr="value">')

You can also give root lexicon and text on instantiation:

>>> d = parce.Document(Xml.root, r'<xml attr="value">')

Or load() a document from the filesystem:

>>> d = parce.Document.load("file.xml")

This method can autodetect the language. To get the tree:

>>> tree = d.get_root(True)
>>> tree.dump()
<Context Xml.root at 0-18 (3 children)>
 ├╴<Token '<' at 0:1 (Delimiter)>
 ├╴<Token 'xml' at 1:4 (Name.Tag)>
 ╰╴<Context Xml.attrs at 5-18 (5 children)>
    ├╴<Token 'attr' at 5:9 (Name.Attribute)>
    ├╴<Token '=' at 9:10 (Delimiter.Operator)>
    ├╴<Token '"' at 10:11 (Literal.String)>
    ├╴<Context Xml.dqstring at 11-17 (2 children)>
    │  ├╴<Token 'value' at 11:16 (Literal.String)>
    │  ╰╴<Token '"' at 16:17 (Literal.String)>
    ╰╴<Token '>' at 17:18 (Delimiter)>

Accessing and modifying text#

All text is available through the text() method, and, just like a Python string, a fragment of text in the Document can be read using the [ ] syntax:

>>> d.text()
'<xml attr="value">'
>>> d[5:17]
'attr="value"'

But you can also modify the text, using the slice syntax:

>>> d[11:16]="Something Completely Else!!"
>>> d.text()
'<xml attr="Something Completely Else!!">'
>>> d.get_root(True).dump()
<Context Xml.root at 0-40 (3 children)>
 ├╴<Token '<' at 0:1 (Delimiter)>
 ├╴<Token 'xml' at 1:4 (Name.Tag)>
 ╰╴<Context Xml.attrs at 5-40 (5 children)>
    ├╴<Token 'attr' at 5:9 (Name.Attribute)>
    ├╴<Token '=' at 9:10 (Delimiter.Operator)>
    ├╴<Token '"' at 10:11 (Literal.String)>
    ├╴<Context Xml.dqstring at 11-39 (2 children)>
    │  ├╴<Token 'Something Completely Else!!' at 11:38 (Literal.String)>
    │  ╰╴<Token '"' at 38:39 (Literal.String)>
    ╰╴<Token '>' at 39:40 (Delimiter)>

Note that we requested the tree again (and awaited it being tokenized) using get_root(True), but the tree returned will always be the same object, for the lifetime of the Document (or to be more precise, of the TreeBuilder the document internally uses).

Using Document.modified_range() we get information about the part that was retokenized since the last change:

>>> d.modified_range()
(11, 38)

This information is provided by the TreeBuilder. Using Document.open_lexicons() we can get the list of lexicons that the TreeBuilder found to be left open by the document:

>>> d.open_lexicons()
[Xml.tag]

In this case, because the xml tag was not closed, an Xml.tag context was left open. We can change that. Using Document.insert() we add one character:

>>> d.insert(39, '/')
>>> d.open_lexicons()
[]
>>> d.get_root().last_token()
<Token '/>' at 39:41 (Delimiter)>
>>> d.modified_range()
(39, 41)

Instead of insert(), we could also have written d[39:39]='/'.

Performing multiple edits in once#

When you want to perform multiple edits in one go, start a with context and apply all desired changes. The document does not change during these edits, so all ranges remain valid during the process.

Only when the with block is exited, the changes are applied and the tree of tokens is updated:

>>> from parce.action import Name
>>> with d:
...     for token in d.get_root().query.all.action(Name.Tag):
...         d[token.pos:token.end] = "yo:" + token.text.upper()
...
>>> d.text()
'<yo:XML attr="Something Completely Else!!"/>'

This incantation replaces all XML tag names with the same name in upper case and with "yo:" prepended.

When editing a document in a with context, it is an error if your changes overlap. Because it is then not clear how the text would look like after applying the changes. For example:

>>> d = parce.Document(Xml.root, r'<xml attr="value">')
>>> with d:
...     d[1:4] = 'XML'
...     d[5:9] = 'attribute'
...     d[6:16] = 'blabla'
...
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  (...)
RuntimeError: overlapping changes: 6 before 9; text='blabla'

When inserting multiple pieces on the same position, the order in which the changes are applied is always respected:

>>> d = parce.Document(Xml.root, r'<xml attr="value">')
>>> with d:
...     d[16:16] = ' value1'
...     d[16:16] = ' value2'
...     d[16:16] = ' value3'
...
>>> d.text()
'<xml attr="value value1 value2 value3">'

Cursor and Block#

Related to Document are Cursor and Block.

A Cursor simply describes a position (pos) in the document, or a selected range (from pos to end). If you write routines that inspect the tokens and then change the text in some way, you can write them so that they expect the cursor as argument, so they get the cursor’s Document, the selected range and the tokenized tree in one go.

A cursor keeps its position updated as the Document changes, as long as you keep a reference to it.

A Block describes a line of text and is instantiated using Document.find_block(), Document.blocks(), Cursor.block() or Cursor.blocks(), and then knows its pos and end in the Document. You can easily iterate over lines of text using the blocks() methods.

Getting at the tokens#

Of course, you can get to the tokens by examining the tree, but there are a few convenience methods. Document.token(pos) returns the token closest at the specified position (and on the same line), and Cursor.token() does the same. Cursor.tokens() yields the tokens in the selected range, if any.

Block.tokens() returns a tuple of the tokens at that line:

>>> from parce import Document
>>> from parce.lang.css import Css
>>> d = Document(Css.root, open('parce/themes/default.css').read())
>>> b = d.find_block(200)
>>> b.tokens()
(<Token 'background' at 203:213 (Name.Property.Definition)>, <Token ':' at 213:214 (Delimiter)>,
<Token 'ivory' at 215:220 (Literal.Color)>, <Token ';' at 220:221 (Delimiter)>)

Maintaining a transformation#

Behind the scenes of Document, a Worker is responsible for updating the tokenized tree (i.e. running the tree builder), but this same worker can also update the transformed result of the tokenized tree.

To enable this, all that’s needed is to add a Transformer to the document’s Worker. You can specify a Transformer (and/or a Worker) on Document construction. Here is an example:

>>> from parce.lang.json import Json
>>> from parce import Document
>>> from parce.transform import Transformer
>>> d = Document(Json.root, transformer=Transformer())
>>> d.set_text('{"key": [1, 2, 3, 4, 5]}')
>>> d.get_transform(True)
{'key': [1, 2, 3, 4, 5]}
>>> d.insert(22, ", 6, 7, 8")
>>> d.get_transform(True)
{'key': [1, 2, 3, 4, 5, 6, 7, 8]}

Note that after inserting some text the transformed result automatically gets updated. If all you need is simply the default transformer, construction of a document is even simpler:

>>> import parce
>>> d = parce.Document(parce.find('json'), '{"key": [1, 2, 3]}', transformer=True)
>>> d.get_transform(True)
{'key': [1, 2, 3]}

More goodies#

The parce.Document class is in fact built from four base classes: AbstractMutableString/MutableString from the mutablestring module, AbstractDocument/Document from the document module, DocumentIOMixin from the docio module and WorkerDocumentMixin from the work module.

Using parce.DocumentInterface (which bundles all those base classes), it is not difficult to design a class that wraps an object representing a text document in a GUI editor. You need only to provide two methods in your wrapper: text() to get all text, and _update_text() to change the text programmatically. When the text is changed, AbstractDocument calls text_changed(), which in WorkerDocumentMixin is implemented to inform the TreeBuilder about a part of text that needs to be retokenized. Also your wrapper class should call text_changed() whenever the user has typed in the editor.

Because a Document is basically a mutable string, we added some more nice methods to perform certain actions like search, replace, and substitution using regular expressions. And even undo/redo! See the document module’s documentation.