Source code for parce.query

# -*- coding: utf-8 -*-
#
# This file is part of the parce Python package.
#
# Copyright © 2019-2020 by Wilbert Berendsen <info@wilbertberendsen.nl>
#
# This module is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This module is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.


r"""
Query the tree using the `query` property of Context and Token.

Normally you need not to import this module in order to use it. Using the
:attr:`~.tree.Node.query` property of any Token or (in most cases) Context, you
can query the token tree to find tokens and contexts, based on lexicons and/or
actions and text contents. You can chain calls in an XPath-like fashion.

This module supplements the various find_xxx methods of every Context object. A
query is a generator, starts at the `query` property of a Context or Token
object, and initially yields just that object.

You can navigate using `children`, `all`, `first`, `last`, `[n]`, `[n:n]`,
`[n:n:n]`, `next`, `previous`, `right`, `left`, `right_siblings`,
`left_siblings`, `map()`, `parent` and `ancestors`. Use `uniq` to remove
double occurrences of nodes, which can e.g. happen when navigating to the
parent of all nodes.

You can narrow down the search using `tokens`, `contexts`, `remove_ancestors`,
`remove_descendants`, `slice()` and `filter()`.

You can search for tokens using `('text')` or `(lexicon)`, `startingwith()`,
`endingwith()`, `containing()`, `matching()`, `action()` or `in_action()`. The
special prefix `is_not` inverts the query, so `query.is_not.containing("bla")`
yields Tokens that do not contain the text "bla".


Examples:

Find all tokens that are the first child of a Context with bla lexicon::

    root.query.all(MyLang.bla)[0]


Find (in Xml) all attributes with name 'name' that are in a <bla> tag::

    root.query.all.action(Name.Tag)("bla").next('name')


Find all tags containing "hi" in their text nodes::

    root.query.all.action(Name.Tag).next.next.action(Text).containing('hi')


Find all comments that have TODO in it::

    root.query.all.action(Comment).containing('TODO')


Find all "\\version" tokens in the root context, that have a "2" in the version
string after it::

    (t for t in root.query.children('\\version')
        if t.query.next.next.containing('2'))

Which could also be written as::

    root.query.children('\\version').filter(
        lambda t: t.query.next.next.containing('2'))


A query is a generator, you can iterate over the results::

    for attrs in q.all.action(Name.Tag)('origin').right:
        for atr in attrs.query.action(Name.Attribute):
            print(atr)


For debugging purposes, there are also the ``list()`` construct, and the
``pick()``, ``count()`` and ``dump()`` methods::

    root.query.all.action(Name.Tag)("img").count() # number of "img" tags
    list(root.query.all.action(Name.Tag)("img"))   # list of all "img" tag name tokens


Note that a (partial) query can be reused, it simply restarts the iteration
over the results. The above could also be written as::

    q = root.query.all.action(Name.Tag)("img")
    q.count()   # number of "img" tags
    list(q)     # list of all "img" tag name tokens


A query resolves to False if there is no single result::

    if token.query.ancestors(LilyPond.header):
        do_something() # the token is a descendant of a LilyPond.header context


You can also directly instantiate a Query object for a list of nodes, if you
want to query those in one go::

    q = Query.from_nodes(nodes)



Summary of the query methods:
-----------------------------

Endpoint methods (some are mainly for debugging):

:attr:`~Query.ls`,
:meth:`~Query.count`,
:meth:`~Query.dump`,
:meth:`~Query.pick`,
:meth:`~Query.pick_last`,
:meth:`~Query.range` and
:meth:`~Query.delete`.


Navigating nodes:
^^^^^^^^^^^^^^^^^

The query itself just yields the node it was started from. But using the
following methods you can find your way through a tree structure. Every method
returns a new Query object, having the previous one as source of nodes. Most
methods are implemented as properties, so you don't have to write parentheses.

:attr:`~Query.all`,
:attr:`~Query.children`,
:attr:`~Query.parent`,
:attr:`~Query.ancestors`,
:attr:`~Query.next`,
:attr:`~Query.previous`,
:attr:`~Query.forward`,
:attr:`~Query.backward`,
:attr:`~Query.right`,
:attr:`~Query.left`,
:attr:`~Query.right_siblings`,
:attr:`~Query.left_siblings`,
:attr:`[n] <Query.__getitem__>`,
:attr:`[n:m] <Query.__getitem__>`,
:attr:`~Query.first`,
:attr:`~Query.last`, and
:meth:`~Query.map`,


Selecting (filtering) nodes:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These methods filter out current nodes without adding new nodes
to the selection:

:attr:`~Query.tokens`,
:attr:`~Query.contexts`,
:attr:`~Query.uniq`,
:attr:`~Query.remove_ancestors`,
:attr:`~Query.remove_descendants`,
:meth:`~Query.slice` and
:meth:`~Query.filter`.

The special :attr:`~Query.is_not` operator inverts the meaning of the
next query, e.g.::

    n.query.all.is_not.startingwith("text")

The following query methods can be inverted by prepending `is_not`:

:meth:`~Query.len`,
:meth:`~Query.in_range`,
:meth:`(lexicon) <Query.__call__>`,
:meth:`(lexicon, lexicon2, ...) <Query.__call__>`,
:meth:`("text") <Query.__call__>`,
:meth:`("text", "text2", ...) <Query.__call__>`,
:meth:`~Query.startingwith`,
:meth:`~Query.endingwith`,
:meth:`~Query.containing`,
:meth:`~Query.matching`,
:meth:`~Query.action` and
:meth:`~Query.in_action`.

There is a subtle difference between `action` and `in_action`: with the
first, the action should exactly match, with the latter the tokens are
selected when the action exactly matches, or is a descendant of the given
action.

"""


import collections
import functools
import itertools
import re
import sys

from .lexicon import Lexicon


def query(func):
    """Make a method result (generator) into a new Query object."""
    @functools.wraps(func)
    def wrapper(self, *args, **kwargs):
        return Query(lambda: func(self, *args, **kwargs))
    return wrapper


def pquery(func):
    """Make a method result into a Query object, and the method a property."""
    return property(query(func))


[docs]class Query: """A Query navigates and filters a node tree. A Query is instantiated either by calling :attr:`Token.query <parce.tree.Node.query>` or :attr:`Context.query <parce.tree.Node.query>`, or by calling :meth:`Query.from_nodes` on a list of nodes (tokens and/or contexts). """ __slots__ = '_gen', '_inv' def __init__(self, gen, invert=False): self._gen = gen self._inv = invert def __iter__(self): return self._gen()
[docs] @classmethod def from_nodes(cls, nodes): """Create a Query object querying a list of nodes in one go.""" return cls(lambda: iter(nodes))
# end points
[docs] def __bool__(self): """Return True if there is at least one result.""" for n in self: return True return False
@property def ls(self): """List current selection of this Query, for debugging purposes.""" for i, n in enumerate(self): print("[{}] {}".format(i, repr(n)))
[docs] def count(self): """Compute the length of the iterable.""" return sum(1 for _ in self)
[docs] def dump(self, file=None, style=None): """Dump all selected nodes to the console (or to file). .. seealso:: :meth:`.tree.Node.dump` """ for n in self: n.dump(file, style)
[docs] def pick(self, default=None): """Pick the first value, or return the default.""" for n in self: return n return default
[docs] def pick_last(self, default=None): """Pick the last value, or return the default.""" for default in self: pass return default
[docs] def range(self): """Return the text range as a tuple (pos, end). The ``pos`` is the lowest pos of the nodes in the current set, and ``end`` is the highest end of the nodes. If the result set is empty, (-1, -1) is returned. """ pos = end = -1 nodes = iter(self) for n in nodes: pos = n.pos end = n.end for n in nodes: pos = min(pos, n.pos) end = max(end, n.end) return pos, end
[docs] def delete(self): """Delete all selected nodes from their parents. Internally calls ``uniq`` and ``remove_descendants``, so that no unnecessary deletes are done. If a context would become empty, that context itself is deleted instead of all its children (except for the root of course). Returns the number of nodes that were deleted. .. note:: If you delete tokens from a tree which belong to a group, the tree cannot reliably be used by a treebuilder for a partial rebuild. """ d = collections.defaultdict(list) for n in self.uniq.remove_descendants: d[n.parent].append(n) count = 0 # deleting a root context makes no sense, clear it in that case roots = d.get(None) if roots: for root in roots: count += len(root) root.clear() del d[None] # if a parent looses all children, remove itself too while True: remove = [parent for parent, nodes in d.items() if len(nodes) == len(parent) and parent.parent is not None] if not remove: break for n in remove: del d[n] d[n.parent].append(n) # be sure nodes to delete are sorted on pos for l in d.values(): l.sort(key=lambda n: n.pos) count += len(l) for parent, nodes in d.items(): n = nodes[0] i = 0 if n is parent[0] else n.parent_index() slices = [[i, i+1]] for n in nodes[1:]: if n is parent[i+1]: i += 1 slices[-1][1] += 1 else: i = n.parent_index() slices.append([i, i+1]) for i, j in reversed(slices): del parent[i:j] return count
# navigators
[docs] @query def __getitem__(self, key): """Get the specified item or items of every context node. Note that the result nodes always form a flat iterable. No IndexError will be raised if an index would be out of range for any node. """ # slicing or itemgetting with integers are not invertible selectors if isinstance(key, slice): for n in self: if n.is_context: yield from n[key] else: for n in self: if n.is_context: k = key + len(n) if key < 0 else key if 0 <= k < len(n): yield n[k]
@pquery def children(self): """All direct children of the current nodes.""" for n in self: if n.is_context: yield from n @pquery def all(self): """All descendants, contexts and their nodes.""" def innergen(n): stack = [] j = 0 while True: for i in range(j, len(n)): m = n[i] yield m if m.is_context: stack.append(i) j = 0 n = m break else: if stack: n = n.parent j = stack.pop() + 1 else: break for n in self: yield n if n.is_context: yield from innergen(n) @pquery def alltokens(self): """Shortcut for all.tokens.""" for n in self: if n.is_token: yield n else: yield from n.tokens() @pquery def allcontexts(self): """Shortcut for all.contexts.""" def innergen(n): stack = [] j = 0 while True: for i in range(j, len(n)): m = n[i] if m.is_context: yield m stack.append(i) j = 0 n = m break else: if stack: n = n.parent j = stack.pop() + 1 else: break for n in self: if n.is_context: yield n yield from innergen(n) @pquery def parent(self): """Yield the parent of every node. This can lead to many double occurrences of the same node in the result set; use :attr:`~Query.uniq` to fix that. """ for n in self: if n.parent: yield n.parent @pquery def ancestors(self): """Yield the ancestor contexts of every node.""" for n in self: yield from n.ancestors() @pquery def first(self): """Yield the first node of every context node, same as [0].""" for n in self: if n and n.is_context: yield n[0] @pquery def last(self): """Yield the last node of every context node, same as [-1].""" for n in self: if n and n.is_context: yield n[-1] @pquery def next(self): """Yield the next token, if any.""" for n in self: t = n.next_token() if t: yield t @pquery def previous(self): """Yield the previous token, if any.""" for n in self: t = n.previous_token() if t: yield t @pquery def forward(self): """Yield Tokens in forward direction.""" for n in self: yield from n.forward() @pquery def backward(self): """Yield Tokens in backward direction.""" for n in self: yield from n.backward() @pquery def right(self): """Yield the right sibling, if any.""" for n in self: n = n.right_sibling() if n: yield n @pquery def left(self): """Yield the left sibling, if any.""" for n in self: n = n.left_sibling() if n: yield n @pquery def right_siblings(self): """Yield the right siblings, if any.""" for n in self: yield from n.right_siblings() @pquery def left_siblings(self): """Yield the left siblings, if any.""" for n in self: yield from n.left_siblings()
[docs] @query def map(self, function): """Call the function on every node and yield its results, which should be zero or more nodes as well.""" for n in self: yield from function(n)
# selectors
[docs] @query def filter(self, predicate): """Yield nodes for which the predicate returns a value that evaluates to True.""" for n in self: if predicate(n): yield n
@pquery def tokens(self): """Get only the tokens.""" for n in self: if n.is_token: yield n @pquery def contexts(self): """Get only the contexts.""" for n in self: if n.is_context: yield n @pquery def uniq(self): """Remove double occurrences of the same node from the result set. This can happen e.g. when you find the parent of multiple nodes. """ seen = set() for n in self: i = id(n) if i not in seen: seen.add(i) yield n
[docs] @query def slice(self, *args): """Slice the full result set, using :py:func:`itertools.islice`. This can help narrowing down the result set. For example:: root.query.all("blaat").slice(1).right_siblings.slice(3) ... will continue the query with only the first occurrence of a token "blaat", and then look for at most three right siblings. If the slice(1) were not there, all the right siblings would become one large result set because you wouldn't know how many tokens "blaat" were matched. """ yield from itertools.islice(self, *args)
@pquery def remove_descendants(self): """Remove nodes that have ancestors in the current node list.""" ids = set(map(id, self)) for n in self: if not any(id(p) in ids for p in n.ancestors()): yield n @pquery def remove_ancestors(self): """Remove nodes that have descendants in the current node list.""" ids = set(map(id, self)) & set(map(id, (p for n in self for p in n.ancestors()))) for n in self: if id(n) not in ids: yield n @property def is_not(self): """Invert the next query.""" return type(self)(self._gen, not self._inv) # invertible selectors
[docs] @query def len(self, min_length, max_length=None): """Only yield contexts, with min_length, or with length between min and max.""" if max_length is None: for n in self: if n.is_context and self._inv ^ (len(n) == min_length): yield n else: for n in self: if n.is_context and self._inv ^ (min_length <= len(n) <= max_length): yield n
[docs] @query def in_range(self, start=0, end=None): """Yield a restricted set, tokens and/or contexts must fall in start→end""" if end is None: end = sys.maxsize # don't assume the tokens are in source order if self._inv: for n in self: if n.end <= start or n.pos >= end: yield n else: for n in self: if n.pos >= start and n.end <= end: yield n
[docs] @query def __call__(self, *what): """Yield token if token has that text, or context if context has that lexicon. You can even mix the types if you'd need to:: for n in tree.query.all("%", Lang.comment): # do something yields tokens that are a percent sign and contexts that have the Lang.comment lexicon. """ for n in self: if self._inv ^ ((n.text if n.is_token else n.lexicon) in what): yield n
[docs] @query def startingwith(self, text): """Yield tokens that start with text.""" for t in self: if t.is_token and self._inv ^ t.text.startswith(text): yield t
[docs] @query def endingwith(self, text): """Yield tokens that end with text.""" for t in self: if t.is_token and self._inv ^ t.text.endswith(text): yield t
[docs] @query def containing(self, text): """Yield tokens that contain the specified text.""" for t in self: if t.is_token and self._inv ^ (text in t.text): yield t
[docs] @query def matching(self, pattern, flags=0): """Yield tokens matching the regular expression. :func:`re.search` is used, so the expression can match anywhere unless you use ^ or $ characters). """ search = re.compile(pattern, flags).search for t in self: if t.is_token and self._inv ^ bool(search(t.text)): yield t
[docs] @query def action(self, *actions): """Yield those tokens whose action *is* one of the given actions.""" for t in self: if t.is_token and self._inv ^ (t.action in actions): yield t
[docs] @query def in_action(self, *actions): """Yield those tokens whose action *is or inherits from* one of the given actions.""" for t in self: if t.is_token and self._inv ^ any(t.action in a for a in actions): yield t