The docio module¶
I/O handling for Documents.
This module defines DocumentIOMixin
to mix in with the other parce
Document base classes, adding load and save methods. These methods are not
mandatory at all; you can choose to implement your own save and load logic.
When a Document is loaded or saved, the filename is stored in the Document’s
url
attribute, and if you specify an
encoding, it is also stored in the Document’s
encoding
attribute.
Besides that, this module enables intelligent encoding determination and
handling, where the language of a Document can point to an IO
subclass
which implements encoding determination based on the document’s language. An
IO
“sister-class” of a Language
can define a
default encoding and provides a method to consult the document’s contents to
see if an encoding is defined there, and use that for I/O operations.
-
class
DecodeResult
(root_lexicon, text, encoding)¶ Bases:
tuple
The result of the
DocumentIOMixin.decode_data()
method.-
encoding
¶ The encoding that was specified or determined, or None.
-
root_lexicon
¶ The root lexicon or None.
-
text
¶ The decoded text.
-
-
DEFAULT_ENCODING
= 'utf-8'¶ The general default encoding, if a Language does not define another.
-
TEMP_TEXT_MAXSIZE
= 5000¶ The maximum size of a text snippet that is searched for an encoding.
-
class
DocumentIOMixin
[source]¶ Bases:
object
Mixin class, adding load and save methods to Document.
It also expects
WorkerDocumentMixin
to be mixed in, because of the root lexicon handling.-
classmethod
load
(url, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]¶ Load text from
url
and return a Document.The current implementation only supports reading a file from the local file system.
The
url
is the filename. If theroot_lexicon
is None, no parsing will be done on the document. If True, guessing will be done using the specifiedregistry
or the default parceregistry
(in which caseurl
andmimetype
both can help in determining the language to use). Ifroot_lexicon
is a string name, the name will be looked up in the registry. Otherwise, it is assumed to be aLexicon
.The
url
and theencoding
are stored in the document’s attributes of the same name. Theencoding
is “utf-8” by default. Theerrors
andnewline
arguments will be passed to the underlyingio.TextIOWrapper
reading the file contents.The
worker
is aWorker
or None. By default, aBackgroundWorker
is used. Thetransformer
is aTransformer
or None. By default, no Transformer is installed. As a convenience, you can specifyTrue
, in which case a default Transformer is installed.
-
save
(url=None, encoding=None, newline=None)[source]¶ Save the document to a local file.
If you specify the
url
orencoding
, the corresponding Document attributes are set as well. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’sIO
handler can define the default encoding to use; the ultimate default is “utf-8”.The
newline
argument will be passed to the underlyingio.TextIOWrapper
that writes the document’s contents.
-
classmethod
from_bytes
(data, url=None, root_lexicon=True, encoding=None, errors=None, newline=None, registry=None, mimetype=None, worker=None, transformer=None)[source]¶ Load text from bytes or bytearray
data
and return a Document.For all the other arguments, see
load()
.
-
to_bytes
(encoding=None, newline=None)[source]¶ Return the binary encoded contents of the document.
The default implementation uses the
encode_text()
function. If the encoding is not specified and also not set in the corresponding document attribute, the encoding to use is searched for in the document’s text; if that is not found, the language’sIO
handler can define the default encoding to use; the ultimate default is “utf-8”.The
newline
argument will be passed to the underlyingio.TextIOWrapper
that writes the document’s contents.
-
classmethod
-
class
IO
[source]¶ Bases:
object
Functional base class for language-specific I/O handling.
You may create a “sister-class” in the same module as a Language, with “IO” appended to the class name, to have your IO-subclass automatically found.
So, if your language has the name “MyLang”, a class “MyLangIO” in the same module that inherits this class, will be used for encoding handling.
-
localfile
(url)[source]¶ Return the local filename the
url
points to.The url is parsed using
urlparse()
. If the url has afile:
scheme, the path is returned. If the url has noscheme
and nonetloc
, the full url is returned so that it is used as a local file.Currently raises a ValueError if the URL has a
netloc
or ascheme
other thanfile:
.
-
decode_data
(data, root_lexicon=None, encoding=None, errors=None, newline=None, registry=None, url=None, mimetype=None)[source]¶ Decode text from the binary (bytes or bytearray)
data
.Returns a named tuple
DecodeResult
(root_lexicon
,text
,encoding
).If the data starts with a byte-order mark (BOM), the encoding that is specified by that BOM is used to read the rest of the data. Otherwise, the data is first interpreted as
latin1
and examined. If no encoding can be determined by looking at the text, the specifiedencoding
is used, or UTF-8 by default.The
root_lexicon
determines how the data is further interpreted: If None, no parsing is done at all. If True, the specifiedregistry
or the default parceregistry
is used to guess the language (in this caseurl
andmimetype
both help in determining the language to use). Ifroot_lexicon
is a string name, it is looked up in the registry. Otherwise it is assumed to be aLexicon
.When the root lexicon’s Language (or one of its superclasses) has an
IO
“sister-class” (i.e. in the same module with the same name with “IO” appended), that IO class’sget_encoding()
method is called to determine the encoding of the text, which may be mentioned in the text in a way specific to that language. If that method returns None,default_encoding()
is called, which also by default returns “utf-8”.E.g. for XML, the
encoding
attribute of the first processing instruction is consulted, for Html the value of a<meta>
tag withcharset
orhttp-equiv
attributes, etc.The
errors
andnewline
arguments will be passed to the underlyingio.TextIOWrapper
reading the file contents.If no
encoding
was specified, the returnedencoding
is the encoding that was finally used to read the text; otherwise it is the specified encoding.
-
encode_text
(text, root_lexicon=None, encoding=None, newline=None)[source]¶ Return a
bytes
object with the encoded text.If
encoding
is None, theroot_lexicon
is used to help finding an encoding set in the document. Thenewline
argument is passed to the underlyingio.TextIOWrapper
writing the file contents.