The regex module

Utility module with functions to construct or manipulate regular expressions.

words2regexp(words)[source]

Convert the specified word list to an optimized regular expression.

Example:

>>> import parce.regex
>>> parce.regex.words2regexp(['opa', 'oma', 'mama', 'papa'])
'(?:mam|pap|o[mp])a'
>>> parce.regex.words2regexp(['car', 'cdr', 'caar', 'cadr', 'cdar', 'cddr'])
'c[ad]{1,2}r'
make_charclass(chars)[source]

Return a string with adjacent characters grouped.

Example:

>>> parce.regex.make_charclass(('a', 'd', 'b', 'f', 'c'))
'a-df'

Supplying a string is also supported:

>>> parce.regex.make_charclass("abcdefghjklmnop")
'a-hj-p'

Special characters are properly escaped.

common_suffix(words)[source]

Return (words, suffix), where suffix is the common suffix.

If there is no common suffix, words is returned unchanged, and suffix is an empty string. If there is a common suffix, that is chopped off the returned words. Example:

>>> parce.regex.common_suffix(['opa', 'oma', 'mama', 'papa'])
(['op', 'om', 'mam', 'pap'], 'a')
to_string(expr)[source]

Convert an unambiguous regexp to a plain string.

If the regular expression is unambiguous and can be converted to a plain string, return it. Otherwise, None is returned.

The returned string can be used with str.find(), which would be faster than using re.search(). Examples:

>>> parce.regex.to_string(r"a.e")
>>> parce.regex.to_string(r"a\.e")
'a.e'
>>> parce.regex.to_string(r"a\ne")
'a\ne'

The first returns None, because the dot can match multiple characters.

make_trie(words, reverse=False)[source]

Return a dict-based radix trie structure from a list of words.

End-points are denoted by a None key, set to True. If reverse is set to True, the trie is made in backward direction, from the end of the words.

Example:

>>> from parce.regex import make_trie
>>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"])
>>> r   # output formatted nicely :-)
{
    "a": {
        "a": {
            "a": {
                "a": {
                    None: True
                },
                "b": {
                    None: True
                }
            },
            "bb": {
                None: True
            }
        },
        "bbb": {
            None: True,
            "b": {
                None: True
            }
        }
    }
}
trie_to_regexp_tuple(node, reverse=False)[source]

Converts the trie node to a tuple of regular expression parts.

A part is either a plain string expression or a frozenset instance. A frozenset instance denotes a group of alternative expressions, and consists of plain string expressions or other tuples. If None is also present in the frozenset, the expression is optional.

Example:

>>> from parce.regex import *
>>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"])
>>> trie_to_regexp_tuple(r)
(
    'a',
    frozenset({
        (
            'a',
            frozenset({
                'bb',
                (
                    'a',
                    frozenset({
                        'a',
                        'b'
                    })
                )
            })
        ),
        (
            'bbb',
            frozenset({
                None,
                'b'
            })
        )
    })
)

This function also recognizes common suffixes within alternative expressions:

>>> r = make_trie("aaaa aaba aaca abca".split())
>>> r
{'a':
    {'a':
        {'aa': {None: True},
         'ba': {None: True},
         'ca': {None: True}},
     'bca': {None: True}}}
>>> t = trie_to_regexp_tuple(r)
>>> t
('a',
 frozenset({'bca',
            ('a',
             frozenset({'c',
                        'a',
                        'b'}),
             'a')}))

(Note that the toplevel common suffix is handled by the common_suffix() function, which is called from words2regexp().)

build_regexp(r)[source]

Convert a tuple to a full regular expression pattern string.

The tuple is described in the trie_to_regexp_tuple() function doc string.

Example:

>>> from parce.regex import *
>>> r = make_trie(["aaaa", "aaab", "aabb", "abbb", "abbbb"])
>>> t = trie_to_regexp_tuple(r)
>>> build_regexp(t)
'a(?:a(?:bb|a[ab])|bbbb?)'

The main function words2regexp() uses this function internally, adding an extra optimization to look for a common suffix.