The lexicon module#
A Lexicon groups rules to match.
A Lexicon is created by decorating a method yielding rules with the
@lexicon
decorator. (Although this actually
creates a LexiconDescriptor. When a LexiconDescriptor is accessed for the first
time via a Language subclass, a Lexicon for that class is created and cached,
and returned each time that attribute is accessed.)
This makes it possible to inherit from a Language class and only re-implement some lexicons, the others keep working as in the base class.
The Lexicon can parse text according to the rules. When its parse()
function is called for the first time, the rules-function is run with the
language class as argument, and the rules it yields are cached.
The Lexicon then combines the patterns of the rules into one regular expression
that is used to parse the text, using some smart optimizations. (For example,
when a lexicon has only one pattern rule which turns out to be an unambigious
string, str.find()
is used rather than using re.search()
.)
Example:
>>> from parce import Language, lexicon
>>>
>>> class MyLang(Language):
... @lexicon
... def numbers(cls):
... yield r'\d+', "A number"
... yield r'\w+', "A word"
...
>>> MyLang.numbers
MyLang.numbers
>>> type(MyLang.numbers)
<class 'parce.lexicon.Lexicon'>
>>> for i in MyLang.numbers.parse("1 a2 d3 4 p 5", 0):
... print(i)
...
(0, '1', <re.Match object; span=(0, 1), match='1'>, 'A number', None)
(2, 'a2', <re.Match object; span=(2, 4), match='a2'>, 'A word', None)
(5, 'd3', <re.Match object; span=(5, 7), match='d3'>, 'A word', None)
(8, '4', <re.Match object; span=(8, 9), match='4'>, 'A number', None)
(10, 'p', <re.Match object; span=(10, 11), match='p'>, 'A word', None)
(12, '5', <re.Match object; span=(12, 13), match='5'>, 'A number', None)
Parsing (better: lexing) is done by a Lexer
instance,
which switches Lexicon when a target is encountered.
- class Lexicon(descriptor, language, arg=None)[source]#
Bases:
object
A Lexicon parses text according to rules.
A Lexicon is tied to a particular class, which makes it possible to inherit from a Language class and change only some Lexicons.
- parse(text, pos)#
Start parsing
text
from the specified position. Yields five-tuples(pos, text, matchobj, action, target)
.The
pos
is the start position a match was found,text
is the matched text,matchobj
the match object (which can be None for default actions),action
the action that was specified in the matching rule, andtarget
is either None or aTarget
object.
- descriptor#
The LexiconDescriptor this Lexicon was created by.
- language#
The Language class the lexicon belongs to.
- re_flags#
The re_flags that were set on instantiation.
- consume#
Whether this lexicon wants the token(s) that switched to it
- name#
The short name (name of the method this Lexicon was defined with)
- fullname#
The short name with the Language name prepended, like
'Language.lexicon'
.
- qualname#
The full name with the Languageās module prepended, like
'parce.lang.xml.Xml.root'
.
- property arg#
The argument the lexicon was called with (creating a derived Lexicon). None for a normal lexicon.
- __call__(arg=None)[source]#
Create a derived Lexicon with argument
arg
.The argument should be a simple, hashable singleton object, such as a string, an integer or a standard action. The created Lexicon is cached. The argument is accessible using special pattern and rule item types, so a derived Lexicon can parse text based on rules that are defined at parse time, which is useful for things like here documents, where you only get to know the end token after the start token has been found.
When comparing Lexicons with
==
, a derived lexicon compares equal with the Lexicon that created it, although they co-exist as separate objects. Useis
to compare on identity.When yielding the rules from a derived lexicon, the dynamic rule items that depend on the Lexicon argument are already evaluated. When yielding the rules from a vanilla lexicon, they are not evaluated, so they adjust themselves to the lexicon they are included in (which will then evaluate the rules of course).
If arg is None, self is returned.
- property rules#
Return all rules in a tuple.
Rule items that depend on the lexicon argument are already evaluated.