Getting started#
We describe how parce works by creating a language definition and using it. Start with:
import parce
or:
from parce import *
The first way is recommended; the latter way of importing is easier to use when defining your own language.
Defining your own language#
A language is simply a class with no other behaviour than that it groups lexicons. A lexicon is a set of rules describing what to look for in text. We define a simple language to get started:
import re
from parce import Language, lexicon, default_action
import parce.action as a # use the standard actions
class Nonsense(Language):
@lexicon
def root(cls):
yield r'\d+', a.Number
yield r'\w+', a.Text
yield r'"', a.String, cls.string
yield r'%', a.Comment, cls.comment
yield r'[.,:?!]', a.Delimiter
@lexicon
def string(cls):
yield r'"', a.String, -1
yield default_action, a.String
@lexicon(re_flags=re.MULTILINE)
def comment(cls):
yield r'$', a.Comment, -1
yield default_action, a.Comment
Language
and lexicon
, are objects imported from parce. Language
is the base class for all language definitions. Text
, Number
,
String
, Delimiter
and Comment
are so-called standard actions.
Standard actions are simple named objects that identify the type of the matched
text. They have no behaviour and are essentially singleton objects using
virtually no memory.
The lexicon
decorator makes a function into a Lexicon
object, which
encapsulates the parsing of text using the rules supplied in the function.
When parsing starts for the first time, the function is called to get the rules. Each rule consists of two or more parts: First the pattern, then the action, and optionally one or more targets. The pattern is a regular expression string; the action may be anything, giving a meaning to the matched text. A target is either a reference to another lexicon, or a number like 1 or -1. Another lexicon is pushed onto the stack, and a number like -1 is used to pop the lexicon off the stack, so that the previous lexicon takes over parsing again.
Parsing text using our language#
Now, we use this language definition to parse some text:
>>> text = '''
... Some text with 3 numbers and 1 "string inside
... over multiple lines", and 1 % comment that
... ends on a newline.
... '''
To parse text, we need to give parce the lexicon to start with. This is
called the root lexicon. To parse the text and get the results, we
call the root()
function of parce:
>>> from parce import root
>>> tree = root(Nonsense.root, text)
The root lexicon in this case is Nonsense.root
, although the name of the
lexicon does not matter at all. But naming the root lexicon root
is
probably a good convention. Let’s dump()
the tree to look what’s inside!
>>> tree.dump()
<Context Nonsense.root at 1-108 (19 children)>
├╴<Token 'Some' at 1:5 (Text)>
├╴<Token 'text' at 6:10 (Text)>
├╴<Token 'with' at 11:15 (Text)>
├╴<Token '3' at 16:17 (Literal.Number)>
├╴<Token 'numbers' at 18:25 (Text)>
├╴<Token 'and' at 26:29 (Text)>
├╴<Token '1' at 30:31 (Literal.Number)>
├╴<Token '"' at 32:33 (Literal.String)>
├╴<Context Nonsense.string at 33-67 (2 children)>
│ ├╴<Token 'string insid...ultiple lines' at 33:66 (Literal.String)>
│ ╰╴<Token '"' at 66:67 (Literal.String)>
├╴<Token ',' at 67:68 (Delimiter)>
├╴<Token 'and' at 69:72 (Text)>
├╴<Token '1' at 73:74 (Literal.Number)>
├╴<Token '%' at 75:76 (Comment)>
├╴<Context Nonsense.comment at 76-89 (1 child)>
│ ╰╴<Token ' comment that' at 76:89 (Comment)>
├╴<Token 'ends' at 90:94 (Text)>
├╴<Token 'on' at 95:97 (Text)>
├╴<Token 'a' at 98:99 (Text)>
├╴<Token 'newline' at 100:107 (Text)>
╰╴<Token '.' at 107:108 (Delimiter)>
>>>
We see that the returned object is a Context
containing Token
and other
Context
instances. A Context is just a Python list, containing the tokens
that a lexicon generated. A Token is a light-weight object knowing its text,
position and the action that was specified in the rule.
Note that anything you do not look for in your lexicons (in this case most
whitespace for example) is simply ignored. But the special rule with
default_action
matches everything not captured by another rule.
This tree structure is what parce provides. You can find tokens on position:
>>> tree.find_token(27) # finds token at position 27
<Token 'and' at 26:29 (Text)>
You can also search for text, or certain actions or lexicons. Both Token and
Context have a query
property that unleashes these powers:
>>> list(tree.query.all("and"))
[<Token 'and' at 26:29 (Text)>, <Token 'and' at 69:72 (Text)>]
>>> list(tree.query.all.action(a.Comment))
[<Token '%' at 75:76 (Comment)>, <Token ' comment that' at 76:89 (Comment)>]
>>> tree.query.all.action(a.Number).count()
3
>>> tree.query.all(Nonsense.string).dump()
<Context Nonsense.string at 33-67 (2 children)>
├╴<Token 'string insid...ultiple lines' at 33:66 (Literal.String)>
╰╴<Token '"' at 66:67 (Literal.String)>
See the query
module for more information.
Note
Is is not needed at all to use the predefined actions of parce in your language definition; you can specify any object you want, including strings or methods.
If you want, you can also get a flat stream of events describing the parsing
process. Events are simply named tuples consisting of a target
and
lexemes
tuples. It is what parce internally uses to build the tree
structure:
>>> from parce import events
>>> for e in events(Nonsense.root, text):
... print(e)
...
Event(target=None, lexemes=((1, 'Some', Text),))
Event(target=None, lexemes=((6, 'text', Text),))
Event(target=None, lexemes=((11, 'with', Text),))
Event(target=None, lexemes=((16, '3', Literal.Number),))
Event(target=None, lexemes=((18, 'numbers', Text),))
Event(target=None, lexemes=((26, 'and', Text),))
Event(target=None, lexemes=((30, '1', Literal.Number),))
Event(target=None, lexemes=((32, '"', Literal.String),))
Event(target=Target(pop=0, push=(Nonsense.string,)), lexemes=((33, 'string inside\nover multiple lines', Literal.String),))
Event(target=None, lexemes=((66, '"', Literal.String),))
Event(target=Target(pop=-1, push=()), lexemes=((67, ',', Delimiter),))
Event(target=None, lexemes=((69, 'and', Text),))
Event(target=None, lexemes=((73, '1', Literal.Number),))
Event(target=None, lexemes=((75, '%', Comment),))
Event(target=Target(pop=0, push=(Nonsense.comment,)), lexemes=((76, ' comment that', Comment),))
Event(target=Target(pop=-1, push=()), lexemes=((90, 'ends', Text),))
Event(target=None, lexemes=((95, 'on', Text),))
Event(target=None, lexemes=((98, 'a', Text),))
Event(target=None, lexemes=((100, 'newline', Text),))
Event(target=None, lexemes=((107, '.', Delimiter),))
More information about the events stream can be found in the documentation
of the lexer
module.