Transforming¶
The transform
module provides infrastructure to transform a
tree structure or a text to any datastructure you wish to create.
The basic idea of transformation is simple: for every Context in a tree
structure, a method of a Transform instance is called. The method has the same
name as the context’s lexicon, and is called with an Items
instance
containing the list of children of that context.
Sub-contexts in that list already have been replaced with the result of that
context’s lexicon’s transformation method, wrapped in an Item
, so the
Items list consists of instances of either Token
or
Item
. To make it easier to distinguish between the two, the Item class
has an is_token
class attribute, set to False.
Thus, a Transform class can closely mimic a corresponding Language class. If
you want to ignore the output of a particular lexicon, don’t define a method
with that name, but set its name to None
in the Transform class definition.
How it works¶
The actual task of transformation (evaluation) is performed by a
Transformer
. The Transformer has infrastructure to choose the
Transform class based on the current Language. Using the
add_transform()
method, you can assign a Transform instance
to a Language class.
There are two convenience functions transform_text()
and
transform_tree()
.
For example:
from parce import root, Language, lexicon, default_action
from parce.action import Delimiter, Number, String
from parce.transform import Transform, transform_text
class MyLang(Language):
@lexicon
def root(cls):
yield r'\[', Delimiter, cls.list
yield r'\d+', Number
yield r'"', String, cls.string
@lexicon
def list(cls):
yield r'\]', Delimiter, -1
yield from cls.root
@lexicon
def string(cls):
yield r'"', String, -1
yield default_action, String
This language definition finds numbers, strings, and lists of those. We want to convert those to their Python equivalents. So, we create a corresponding Transform class, with methods having the same name as the lexicons in the Language definition:
class MyLangTransform(Transform):
def root(self, items):
result = []
for i in items:
if i.is_token:
if i.action is Number:
result.append(int(i.text)) # a Number
else:
result.append(i.obj) # a list or string
return result
def list(self, items):
return self.root(items)
def string(self, items):
return items[0].text # not the closing quote
Now let’s test our Transform!
>>> transform_text(MyLang.root, '1 2 3 [4 "Q" 6] x 7 8 9')
[1, 2, 3, [4, 'Q', 6], 7, 8, 9]
It works! Note that the stray x is ignored, because it is not matched by any rule. The above function call is equivalent to:
>>> from parce.transform import Transformer
>>> t = Transformer()
>>> t.add_transform(MyLang, MyLangTransform())
>>> t.transform_text(MyLang.root, '1 2 3 [4 "Q" 6] x 7 8 9')
[1, 2, 3, [4, 'Q', 6], 7, 8, 9]
Transforming a tree structure¶
Using the same Transform class, you can also transform a tree structure:
>>> from parce.transform import transform_tree
>>> tree = root(MyLang.root, '1 2 3 [4 "Q" 6] x 7 8 9')
>>> tree.dump()
<Context MyLang.root at 0-23 (8 children)>
├╴<Token '1' at 0:1 (Literal.Number)>
├╴<Token '2' at 2:3 (Literal.Number)>
├╴<Token '3' at 4:5 (Literal.Number)>
├╴<Token '[' at 6:7 (Delimiter)>
├╴<Context MyLang.list at 7-15 (5 children)>
│ ├╴<Token '4' at 7:8 (Literal.Number)>
│ ├╴<Token '"' at 9:10 (Literal.String)>
│ ├╴<Context MyLang.string at 10-12 (2 children)>
│ │ ├╴<Token 'Q' at 10:11 (Literal.String)>
│ │ ╰╴<Token '"' at 11:12 (Literal.String)>
│ ├╴<Token '6' at 13:14 (Literal.Number)>
│ ╰╴<Token ']' at 14:15 (Delimiter)>
├╴<Token '7' at 18:19 (Literal.Number)>
├╴<Token '8' at 20:21 (Literal.Number)>
╰╴<Token '9' at 22:23 (Literal.Number)>
>>> transform_tree(tree)
[1, 2, 3, [4, 'Q', 6], 7, 8, 9]
Note
Note that the transform_tree()
gets the root lexicon from the root
element, and then automatically finds the corresponding Transform class, if
you didn’t specify one yourself.
This is done by looking in the same module as the root lexicon’s language,
and finding there a Transform subclass with the same name with
"Transform"
appended (see Transformer.find_transform()
).
Examples of Transform classes can be found in the css
,
csv
and the json
modules.
Calculator example¶
As a proof of concept, below is a simplistic calculator, it can be
found in tests/calc.py
:
# Calculator parce example
from parce import Language, lexicon, default_target, skip
from parce.action import Number, Operator, Delimiter
from parce.transform import Transform
skip_whitespace = (r'\s+', skip)
class Calculator(Language):
@lexicon
def root(cls):
yield r'\d+', Number
yield r'\-', Operator, cls.subtract
yield r'\+', Operator, cls.add
yield r'\*', Operator, cls.multiply
yield r'/', Operator, cls.divide
yield r'\(', Delimiter, cls.parens
@lexicon
def parens(cls):
yield r'\)', Delimiter, -1
yield from cls.root
@lexicon
def subtract(cls):
yield r'\d+', Number
yield r'\*', Operator, cls.multiply
yield r'/', Operator, cls.divide
yield r'\(', Delimiter, cls.parens
yield skip_whitespace
yield default_target, -1
@lexicon
def add(cls):
yield from cls.subtract
@lexicon
def multiply(cls):
yield r'\d+', Number
yield r'\(', Delimiter, cls.parens
yield skip_whitespace
yield default_target, -1
@lexicon
def divide(cls):
yield from cls.multiply
class CalculatorTransform(Transform):
def root(self, items):
result = 0
for i in items:
if i.is_token:
if i.action is Number:
result = int(i.text)
elif i.name == "add":
result += i.obj
elif i.name == "subtract":
result -= i.obj
elif i.name == "multiply":
result *= i.obj
elif i.name == "divide":
result /= i.obj
else: # i.name == "parens":
result = i.obj
return result
parens = add = subtract = multiply = divide = root
Test it with:
>>> from parce.transform import transform_text
>>> from tests.calc import Calculator # (from source directory)
>>> transform_text(Calculator.root, " 1 + 1 ")
2
>>> transform_text(Calculator.root, " 1 + 2 * 3 ")
7
>>> transform_text(Calculator.root, " 1 * 2 + 3 ")
5
>>> transform_text(Calculator.root, " (1 + 2) * 3 ")
9
Integration with TreeBuilder¶
It is easy to keep a transformed structure up-to-date when a tree changes. The Transformer caches the result of every transform method using a weak reference to the Context that yielded that result. So when modifications to a text are small, in most cases the Transformer is very quick with applying the necessary changes to the transformed result.
When the TreeBuilder changes the tree, it emits the event "invalidate"
with the youngest node that has its children changed (i.e. tokens or contexts
were added or removed).
The Transformer then knows that that context and all its ancestors need to be recomputed, and removes them from its cache. During transformation all newly added contexts are evaluated as well, because their transformations can’t be found in the cache.
Note
Contexts that only changed position are not recomputed. If you want your
transformed structure to know the position in the text, you should store
references to the corresponding tokens in your structure. The pos
attribute of the Tokens that move is adjusted by the tree builder, so they
still point to the right position after an update of the tree.
When the tree builder is about to insert the modified tree part in the
original tree, it emits the "replace"
event. The transformer reacts by
interrupting any current job that might be busy computing the transformed
result. Finally, when the tree builder emits "finished"
the transformer
rebuilds our transformed result, using as much as possible the previously
cached transform results for Contexts that did not change.
A single Transformer can be used for multiple transformation jobs for multiple documents or tree builders, even at the same time. It shares the added Transform instances between multiple jobs and documents. If your Transform classes keep internal state that might not be desirable; in that case you can use a Transformer for every document or tree.
One way to automatically run a Transformer from a TreeBuilder is using the
Transformer.connect_treebuilder()
method, to setup all needed
connections. Here is an example:
>>> from parce.lang.json import Json
>>> from parce.treebuilder import TreeBuilder
>>> from parce.transform import Transformer
>>>
>>> b = TreeBuilder(Json.root)
>>> t = Transformer()
>>> t.connect_treebuilder(b)
>>>
>>> b.rebuild('{"key": [1, 2, 3, 4, 5]}')
>>> t.result(b.root)
{'key': [1, 2, 3, 4, 5]}
>>> b.rebuild('{"key": [1, 2, 3, 4, 5, 6, 7, 8]}', False, 22, 0, 9)
>>> t.result(b.root)
{'key': [1, 2, 3, 4, 5, 6, 7, 8]}
The call to TreeBuilder.rebuild()
might seem overwhelming: we instruct to re-parse the text, starting at position
22 with 0 characters removed and 9 added. And now the transform is
automatically updated.
But, it is much easier to use the Document
feature provided by parce,
because that keeps track of the text and its modifications, and can
automatically keep the tokenized tree and the transformed result up to date.
So head on to the next chapter!