The docio module#
I/O handling for Documents.
This module defines DocumentIOMixin to mix in with the other parce
Document base classes, adding load and save methods. These methods are not
mandatory at all; you can choose to implement your own save and load logic.
When a Document is loaded or saved, the filename is stored in the Document’s
url attribute, and if you specify an
encoding, it is also stored in the Document’s
encoding attribute.
Besides that, this module enables intelligent encoding determination and
handling, where the language of a Document can point to an IO subclass
which implements encoding determination based on the document’s language. An
IO “sister-class” of a Language can define a
default encoding and provides a method to consult the document’s contents to
see if an encoding is defined there, and use that for I/O operations.
- class DecodeResult(root_lexicon, text, encoding)#
Bases:
tupleThe result of the
DocumentIOMixin.decode_data()method.- encoding#
The encoding that was specified or determined, or None.
- root_lexicon#
The root lexicon or None.
- text#
The decoded text.
- DEFAULT_ENCODING = 'utf-8'#
The general default encoding, if a Language does not define another.
- TEMP_TEXT_MAXSIZE = 5000#
The maximum size of a text snippet that is searched for an encoding.
- class DocumentIOMixin[source]#
Bases:
objectMixin class, adding load and save methods to Document.
It also expects
WorkerDocumentMixinto be mixed in, because of the root lexicon handling.- classmethod load(url, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]#
Load text from
urland return a Document.The current implementation only supports reading a file from the local file system.
The
urlis the filename. If theroot_lexiconis None, no parsing will be done on the document. If True, guessing will be done using the specifiedregistryor the default parceregistry(in which caseurlandmimetypeboth can help in determining the language to use). Ifroot_lexiconis a string name, the name will be looked up in the registry. Otherwise, it is assumed to be aLexicon.The
urland theencodingare stored in the document’s attributes of the same name. Theencodingis “utf-8” by default. Theerrorsandnewlinearguments will be passed to the underlyingio.TextIOWrapperreading the file contents.The
workeris aWorkeror None. By default, aBackgroundWorkeris used. Thetransformeris aTransformeror None. By default, no Transformer is installed. As a convenience, you can specifyTrue, in which case a default Transformer is installed.
- save(url=None, encoding=None, newline=None)[source]#
Save the document to a local file.
If you specify the
urlorencoding, the corresponding Document attributes are set as well. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’sIOhandler can define the default encoding to use; the ultimate default is “utf-8”.The
newlineargument will be passed to the underlyingio.TextIOWrapperthat writes the document’s contents.
- classmethod from_bytes(data, url=None, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]#
Load text from bytes or bytearray
dataand return a Document.For all the other arguments, see
load().
- to_bytes(encoding=None, newline=None)[source]#
Return the binary encoded contents of the document.
The default implementation uses the
encode_text()function. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’sIOhandler can define the default encoding to use; the ultimate default is “utf-8”.The
newlineargument will be passed to the underlyingio.TextIOWrapperthat writes the document’s contents.
- class IO[source]#
Bases:
objectFunctional base class for language-specific I/O handling.
You may create a “sister-class” in the same module as a Language, with “IO” appended to the class name, to have your IO-subclass automatically found.
So, if your language has the name “MyLang”, a class “MyLangIO” in the same module that inherits this class, will be used for encoding handling.
- localfile(url)[source]#
Return the local filename the
urlpoints to.The url is parsed using
urlparse(). If the url has afile:scheme, the path is returned. If the url has noschemeand nonetloc, the full url is returned so that it is used as a local file.Currently raises a ValueError if the URL has a
netlocor aschemeother thanfile:.
- decode_data(data, root_lexicon=None, encoding=None, errors=None, newline=None, registry=None, url=None, mimetype=None)[source]#
Decode text from the binary (bytes or bytearray)
data.Returns a named tuple
DecodeResult(root_lexicon,text,encoding).If the data starts with a byte-order mark (BOM), the encoding that is specified by that BOM is used to read the rest of the data. Otherwise, the data is first interpreted as
latin1and examined. If no encoding can be determined by looking at the text, the specifiedencodingis used, or UTF-8 by default.The
root_lexicondetermines how the data is further interpreted: If None, no parsing is done at all. If True, the specifiedregistryor the default parceregistryis used to guess the language (in this caseurlandmimetypeboth help in determining the language to use). Ifroot_lexiconis a string name, it is looked up in the registry. Otherwise it is assumed to be aLexicon.When the root lexicon’s Language (or one of its superclasses) has an
IO“sister-class” (i.e. in the same module with the same name with “IO” appended), that IO class’sfind_encoding()method is called to determine the encoding of the text, which may be mentioned in the text in a way specific to that language. If that method returns None,default_encoding()is called, which also by default returns “utf-8”.E.g. for XML, the
encodingattribute of the first processing instruction is consulted, for Html the value of a<meta>tag withcharsetorhttp-equivattributes, etc.The
errorsandnewlinearguments will be passed to the underlyingio.TextIOWrapperreading the file contents.If no
encodingwas specified, the returnedencodingis the encoding that was finally used to read the text; otherwise it is the specified encoding.
- encode_text(text, root_lexicon=None, encoding=None, newline=None)[source]#
Return a
bytesobject with the encoded text.If
encodingis None, theroot_lexiconis used to help finding an encoding set in the document. Thenewlineargument is passed to the underlyingio.TextIOWrapperwriting the file contents.