The rule module#
Replacable rule item objects and helper functions.
Instead of a fixed pattern, action and target you can use dynamic rule items, which are replaced before, during and after lexing the text.
Dynamic rule items can adjust the rule to the lexicon argument, the match object of the regular expression match (if a rule matches) or the matched text.
Rule items that depend on the lexicon argument are already evaluated before the lexicon is first used. Rule items that depend on the text or the match object are evaluated during lexing when a rule matches. Dynamic actions are evaluated after lexing, when generating tokens.
Rule items may not inject arbitrary values in rules; for validation purposes it
must always be clear what kind of items a rule could contain before it is used.
So in most cases a select()
function will be used with a predicate that
returns the index of the item to select.
There are also some helper functions that generate output directly, with no special behaviour afterwards.
The following rule items and helper functions are available:
- ARG = ARG#
The lexicon argument.
- MATCH = MATCH#
The regular expression match object.
You can access a specific group using
MATCH[n]
; groups start with 1. EvenMATCH[n][s]
is possible, which yields a slice of the matched text in groupn
.
- TEXT = TEXT#
The matched text.
You can use
TEXT[s]
to get a slice of the matched text.
- anyof(lexicon, *target)[source]#
Yield certain rules from the specified
lexicon
, adding atarget
.Rules that specify a target themselves, and rules starting with
default_action
ordefault_target
, are skipped. If notarget
is specified, thelexicon
becomes the target itself (specify0
to suppress that).So when you use this function in a lexicon
mylexicon
like this:@lexicon def mylexicon(cls): yield from anyof(cls.other_lexicon) @lexicon def other_lexicon(cls): yield "patt1", Name.Symbol yield "patt2", Delimiter, cls.yet_another_lexicon
the first rule of
other_lexicon
is yielded as:("patt1", Name.Symbol, cls.other_lexicon)
but the second rule
"patt2"
is not yielded, because it has a target itself.
- arg(escape=True, prefix='', suffix='', default=None)[source]#
Create a pattern that contains the argument the current Lexicon was called with.
If there is no argument in the current lexicon, or the argument is not a string, this
pattern
yields thedefault
value (by default None, resulting in the rule being skipped).When there is a string argument, it is escaped using
re.escape()
(whenescape
was set to True), and if given,prefix
is prepended andsuffix
is appended. When the default value is used,prefix
andsuffix
are not used.
- bygroup(*actions)[source]#
Return a
SubgroupAction
that yields tokens for each subgroup in a regular expression.This action uses capturing subgroups in the regular expression pattern and creates a Token for every subgroup, with that action. You should provide the same number of actions as there are capturing subgroups in the pattern. Use non-capturing subgroups for the parts you’re not interested in, or the special
skip
action.An example from the CSS language definition:
yield r"(url)(\()", bygroup(Name, Delimiter), cls.url_function
If this rule matches, it generates two tokens, one for “url” and the other for the opening parenthesis, each with their own action.
- chars(chars, positive=True)[source]#
Return a regular expression pattern matching one of the characters in the specified string or iterable.
If positive is False, the set of characters is complemented, i.e. the pattern matches any single character that is not in the specified string.
An example:
>>> from parce.rule import chars >>> chars('zbdkeghjlmfnotpqaruscvx') '[a-hj-vxz]'
- derive(lexicon, argument)[source]#
Yield a derived lexicon with argument.
Example:
yield "['\"]", String, derive(cls.string, TEXT)
This enters the lexicon
string
with a double quote as argument when a double quote is encountered, but with a single quote when a single quote was encountered.(Deriving a lexicon is not possible with the
call
statement, because that is not allowed as toplevel rule item.)
- dselect(item, mapping, default=())[source]#
Yield the
item
from the specifiedmapping
(dictionary).If the item can’t be found in the mapping, returns
default
.An example from the LilyPond music language definition:
RE_LILYPOND_LYRIC_TEXT = r'[^{}"\\\s$#\d]+' yield RE_LILYPOND_LYRIC_TEXT, dselect(TEXT, { "--": LyricHyphen, "__": LyricExtender, "_": LyricSkip, }, LyricText)
This matches any text blob, but some text items get their own action.
- findmember(item, pairs, default=())[source]#
Yield the item corresponding to the first sequence the item is found in.
The
pairs
argument is an iterable of tuples(sequence, result). When a sequence contains the item,result
is yielded. When no sequence contained the item,default
is yielded.The
pairs
argument can also be a dictionary, in case the order does not matter.
- gselect(*results, default=())[source]#
Yield one of the results if that group contributes to the match.
For example:
gselect(arg1, arg2, arg3, arg4, default=default)
is equivalent to:
ifgroup(1, arg1, ifgroup(2, arg2, ifgroup(3, arg3, ifgroup(4, arg4, default))))
When an
arg
is None, that group is skipped, so:gselect(arg1, None, arg2, arg3)
is equivalent to:
ifgroup(1, arg1, ifgroup(3, arg2, ifgroup(4, arg3)))
- ifarg(pat, else_pat=None)[source]#
Create a pattern that returns the specified regular expression
pat
if the lexicon was called with an argument.If there is no argument in the current lexicon,
else_pat
is yielded, which is None by default, resulting in the rule being skipped.
- ifeq(a, b, result, else_result=())[source]#
Yield
result
ifa == b
, elseelse_result
.This example selects actions and target based on the contents of the second subgroup in the match object:
yield r'([^\W\d]\w*)\s*([\(\]])', \ ifeq(MATCH[2], '(', (bygroup(Name.Function, Delimiter), cls.func_call), (bygroup(Name.Variable, Delimiter), cls.subscript))
- ifgroup(n, result, else_result=())[source]#
Yield
result
if match groupn
is not None.A regular expression match group is None when the group did not contribute to the match. For example, in the first expression the second group is None, while in the second expression the second group is the empty string:
re.match(r'(a)(b)?', "ac").group(2) # → None re.match(r'(a)(b?)', "ac").group(2) # → ''
Shortcut for:
select(call(operator.ne, MATCH[n], None), else_result, result)
- ifmember(item, sequence, result, else_result=())[source]#
Yield
result
ifitem in sequence
, elseelse_result
.Example:
commands = ['begin', 'end', 'if'] yield r'\\\w+', ifmember(TEXT[1:], commands, Keyword, Name.Variable)
This example matches any command that starts with a backslash, e.g.
\begin
, but checks membership in a list without the backslash prepended.Membership testing is optimized for speed by turning the sequence into a frozen set.
- pattern(value)[source]#
Yield the value (string or None), usable as regular expression.
If None, the whole rule is skipped. This rule item may only be used as the first item in a rule, and of course, it may not depend on the TEXT or MATCH variables, but it may depend on the ARG variable (which enables you to create patterns that depend on the lexicon argument).
- select(index, *items)[source]#
Yield the item pointed to by the index.
In most use cases the index will be the result of a predicate function, which returns an integer value (or True or False, which evaluate to 1 and 0, respectively).
The following example rule yield tokens for any word, giving it the Keyword action when the matched text could be found in the keywords_list, and otherwise Name.Command:
keywords_list = ['def', 'class', 'for', 'if', 'else', 'return'] def is_keyword(text): return text in keywords_list class MyLang(Language): @lexicon def root(cls): yield r'\w+', select(call(is_keyword, TEXT), Name.Command, Keyword)
If the selected item is a list or tuple, it is unrolled when injected into the rule.
(For this kind of membership testing, you could also use the
ifmember()
helper function.)
- target(value, *lexicons)[source]#
Yield either an integer target value, or a (possibly derived) Lexicon.
Using this rule item you can have one predicate function decide whether to push the same lexicon again, or to pop, or to target another lexicon, which may also be derived.
This is how it works: when the value is an integer, it is returned. Otherwise the value must be a two-tuple(index, argument). The index then selects one of the provided lexicons and the argument (if not None), calls the lexicon to get a derived lexicon, which is then yielded as result of this rule item.
Here are some examples:
target(-1)
yields -1. And:
target((1, "bla"), MyLang.lexicon1, MyLang.lexicon2)
yields
MyLang.lexicon2("bla")
.The following two incantations are equivalent (where n can be any expression):
target((n, None), MyLang.lexicon1, MyLang.lexicon2) select(n, MyLang.lexicon1, MyLang.lexicon2)
Finally:
target(call(my_predicate, TEXT), MyLang.lexicon1, MyLang.lexicon2)
calls
my_predicate
with the matched text, and then uses the return value to either directly return or choose a lexicon.
- using(lexicon)[source]#
Return a
DelegateAction
that yields tokens using the specified lexicon.All tokens are yielded as one group, flattened, ignoring the tree structure, so this is not efficient for large portions of text, as the whole region is parsed again on every modification.
But it can be useful when you want to match a not too large text blob first that’s difficult to capture otherwise, and then lex it with a lexicon that does (almost) not enter other lexicons.
- words(words, prefix='', suffix='')[source]#
Return an optimized regular expression pattern matching any of the words in the specified sequence.
A
prefix
orsuffix
can be given, which will be added to the regular expression. Using the word boundary character\b
as suffix is recommended to be sure the match ends at a word end.Here is an example:
>>> from parce.rule import words >>> CONSTANTS = ('true', 'false', 'null') >>> words(CONSTANTS, r'\b', r'\b') '\\b(?:null|(?:fals|tru)e)\\b'