The docio module

I/O handling for Documents.

This module defines DocumentIOMixin to mix in with the other parce Document base classes, adding load and save methods. These methods are not mandatory at all; you can choose to implement your own save and load logic.

When a Document is loaded or saved, the filename is stored in the Document’s url attribute, and if you specify an encoding, it is also stored in the Document’s encoding attribute.

Besides that, this module enables intelligent encoding determination and handling, where the language of a Document can point to an IO subclass which implements encoding determination based on the document’s language. An IO “sister-class” of a Language can define a default encoding and provides a method to consult the document’s contents to see if an encoding is defined there, and use that for I/O operations.

class DecodeResult(root_lexicon, text, encoding)

Bases: tuple

The result of the DocumentIOMixin.decode_data() method.

encoding

The encoding that was specified or determined, or None.

root_lexicon

The root lexicon or None.

text

The decoded text.

DEFAULT_ENCODING = 'utf-8'

The general default encoding, if a Language does not define another.

TEMP_TEXT_MAXSIZE = 5000

The maximum size of a text snippet that is searched for an encoding.

class DocumentIOMixin[source]

Bases: object

Mixin class, adding load and save methods to Document.

It also expects WorkerDocumentMixin to be mixed in, because of the root lexicon handling.

classmethod load(url, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]

Load text from url and return a Document.

The current implementation only supports reading a file from the local file system.

The url is the filename. If the root_lexicon is None, no parsing will be done on the document. If True, guessing will be done using the specified registry or the default parce registry (in which case url and mimetype both can help in determining the language to use). If root_lexicon is a string name, the name will be looked up in the registry. Otherwise, it is assumed to be a Lexicon.

The url and the encoding are stored in the document’s attributes of the same name. The encoding is “utf-8” by default. The errors and newline arguments will be passed to the underlying io.TextIOWrapper reading the file contents.

The worker is a Worker or None. By default, a BackgroundWorker is used. The transformer is a Transformer or None. By default, no Transformer is installed. As a convenience, you can specify True, in which case a default Transformer is installed.

save(url=None, encoding=None, newline=None)[source]

Save the document to a local file.

If you specify the url or encoding, the corresponding Document attributes are set as well. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’s IO handler can define the default encoding to use; the ultimate default is “utf-8”.

The newline argument will be passed to the underlying io.TextIOWrapper that writes the document’s contents.

classmethod from_bytes(data, url=None, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]

Load text from bytes or bytearray data and return a Document.

For all the other arguments, see load().

to_bytes(encoding=None, newline=None)[source]

Return the binary encoded contents of the document.

The default implementation uses the encode_text() function. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’s IO handler can define the default encoding to use; the ultimate default is “utf-8”.

The newline argument will be passed to the underlying io.TextIOWrapper that writes the document’s contents.

class IO[source]

Bases: object

Functional base class for language-specific I/O handling.

You may create a “sister-class” in the same module as a Language, with “IO” appended to the class name, to have your IO-subclass automatically found.

So, if your language has the name “MyLang”, a class “MyLangIO” in the same module that inherits this class, will be used for encoding handling.

classmethod get(lexicon)[source]

Get an IO handler for this lexicon’s language.

If the lexicon is None, a new instance of the called IO is returned.

default_encoding()[source]

Return the default encoding to use.

find_encoding(text)[source]

Return an encoding stored inside the piece of text.

The default implementation recognizes some encoding=”xxx” and (en)coding: xxxx variants. Returns None if no encoding is found.

localfile(url)[source]

Return the local filename the url points to.

The url is parsed using urlparse(). If the url has a file: scheme, the path is returned. If the url has no scheme and no netloc, the full url is returned so that it is used as a local file.

Currently raises a ValueError if the URL has a netloc or a scheme other than file:.

decode_data(data, root_lexicon=None, encoding=None, errors=None, newline=None, registry=None, url=None, mimetype=None)[source]

Decode text from the binary (bytes or bytearray) data.

Returns a named tuple DecodeResult (root_lexicon, text, encoding).

If the data starts with a byte-order mark (BOM), the encoding that is specified by that BOM is used to read the rest of the data. Otherwise, the data is first interpreted as latin1 and examined. If no encoding can be determined by looking at the text, the specified encoding is used, or UTF-8 by default.

The root_lexicon determines how the data is further interpreted: If None, no parsing is done at all. If True, the specified registry or the default parce registry is used to guess the language (in this case url and mimetype both help in determining the language to use). If root_lexicon is a string name, it is looked up in the registry. Otherwise it is assumed to be a Lexicon.

When the root lexicon’s Language (or one of its superclasses) has an IO “sister-class” (i.e. in the same module with the same name with “IO” appended), that IO class’s get_encoding() method is called to determine the encoding of the text, which may be mentioned in the text in a way specific to that language. If that method returns None, default_encoding() is called, which also by default returns “utf-8”.

E.g. for XML, the encoding attribute of the first processing instruction is consulted, for Html the value of a <meta> tag with charset or http-equiv attributes, etc.

The errors and newline arguments will be passed to the underlying io.TextIOWrapper reading the file contents.

If no encoding was specified, the returned encoding is the encoding that was finally used to read the text; otherwise it is the specified encoding.

encode_text(text, root_lexicon=None, encoding=None, newline=None)[source]

Return a bytes object with the encoded text.

If encoding is None, the root_lexicon is used to help finding an encoding set in the document. The newline argument is passed to the underlying io.TextIOWrapper writing the file contents.