The regex module¶
Utility module with functions to construct or manipulate regular expressions.
-
words2regexp
(words)[source]¶ Convert the specified word list to an optimized regular expression.
Example:
>>> import parce.regex >>> parce.regex.words2regexp(['opa', 'oma', 'mama', 'papa']) '(?:mam|pap|o[mp])a' >>> parce.regex.words2regexp(['car', 'cdr', 'caar', 'cadr', 'cdar', 'cddr']) 'c[ad]{1,2}r'
-
make_charclass
(chars)[source]¶ Return a string with adjacent characters grouped.
Example:
>>> parce.regex.make_charclass(('a', 'd', 'b', 'f', 'c')) 'a-df'
Supplying a string is also supported:
>>> parce.regex.make_charclass("abcdefghjklmnop") 'a-hj-p'
Special characters are properly escaped.
-
common_suffix
(words)[source]¶ Return (words, suffix), where suffix is the common suffix.
If there is no common suffix, words is returned unchanged, and suffix is an empty string. If there is a common suffix, that is chopped off the returned words. Example:
>>> parce.regex.common_suffix(['opa', 'oma', 'mama', 'papa']) (['op', 'om', 'mam', 'pap'], 'a')
-
to_string
(expr)[source]¶ Convert an unambiguous regexp to a plain string.
If the regular expression is unambiguous and can be converted to a plain string, return it. Otherwise, None is returned.
The returned string can be used with
str.find()
, which would be faster than usingre.search()
. Examples:>>> parce.regex.to_string(r"a.e") >>> parce.regex.to_string(r"a\.e") 'a.e' >>> parce.regex.to_string(r"a\ne") 'a\ne'
The first returns None, because the dot can match multiple characters.
-
make_trie
(words, reverse=False)[source]¶ Return a dict-based radix trie structure from a list of words.
End-points are denoted by a None key, set to True. If reverse is set to True, the trie is made in backward direction, from the end of the words.
Example:
>>> from parce.regex import make_trie >>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"]) >>> r # output formatted nicely :-) { "a": { "a": { "a": { "a": { None: True }, "b": { None: True } }, "bb": { None: True } }, "bbb": { None: True, "b": { None: True } } } }
-
trie_to_regexp_tuple
(node, reverse=False)[source]¶ Converts the trie node to a tuple of regular expression parts.
A part is either a plain string expression or a frozenset instance. A frozenset instance denotes a group of alternative expressions, and consists of plain string expressions or other tuples. If None is also present in the frozenset, the expression is optional.
Example:
>>> from parce.regex import * >>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"]) >>> trie_to_regexp_tuple(r) ( 'a', frozenset({ ( 'a', frozenset({ 'bb', ( 'a', frozenset({ 'a', 'b' }) ) }) ), ( 'bbb', frozenset({ None, 'b' }) ) }) )
This function also recognizes common suffixes within alternative expressions:
>>> r = make_trie("aaaa aaba aaca abca".split()) >>> r {'a': {'a': {'aa': {None: True}, 'ba': {None: True}, 'ca': {None: True}}, 'bca': {None: True}}} >>> t = trie_to_regexp_tuple(r) >>> t ('a', frozenset({'bca', ('a', frozenset({'c', 'a', 'b'}), 'a')}))
(Note that the toplevel common suffix is handled by the
common_suffix()
function, which is called fromwords2regexp()
.)
-
build_regexp
(r)[source]¶ Convert a tuple to a full regular expression pattern string.
The tuple is described in the
trie_to_regexp_tuple()
function doc string.Example:
>>> from parce.regex import * >>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"]) >>> t = trie_to_regexp_tuple(r) >>> build_regexp(t) 'a(?:a(?:bb|a[ab])|bbbb?)'
The main function
words2regexp()
uses this function internally, adding an extra optimization to look for a common suffix.