Adjust raw text extraction from docutils documents (#33)

The previous version of this code relied on the Text.rawsource attribute
to obtain the raw, original version of the translated texts contained in
.po files. This attribute however was removed in docutils 0.18, and thus
a different way of obtaining this information was needed.

(Note that this attribute removal was planned, but not for this release
yet: it's currently listed not in 0.18's list of changes, but under
"Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been
opened to get this eventually clarified)

The commit that removed the Text.rawsource mentioned that the data fed
into the Text elements was already the raw source, hence there was no
need to keep a separate attribute. Text objects derive from str, so we
can directly add them to the list of strings where NodeToTextVisitor
builds the original text, with the caveat that it needs to have
backslashes restored (they are encoded as null bytes after parsing,
apparently).

The other side-effect of using the Text objects directly instead of the
Text.rawsoource attribute is that now we get more of them. The document
resulting from docutils' parsing can contain system_message elements
with debugging information from the parsing process, such as warnings.
These are Text elements with no rawsource, but with actual text, so we
need to skip them. In the same spirit, citation_references and
substitution_references need to be ignored as well.

All these changes allow pospell to work against the latest docutils. On
the other hand, the lowest supported version is 0.16: 0.11 through 0.14
failed at rfc role parsing (used for example in the python docs), and
0.15 didn't have a method to restore backslashes (which again made the
python docs fail).

Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
This commit is contained in:
rtobar 2021-12-01 00:57:04 +08:00 committed by GitHub
parent 2844284bb7
commit c4feb4d25f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 5 additions and 2 deletions

View File

@ -90,11 +90,14 @@ class NodeToTextVisitor(docutils.nodes.NodeVisitor):
"emphasis",
"superscript",
"title_reference",
"substitution_reference",
"citation_reference",
"strong",
"DummyNodeClass",
"reference",
"literal",
"Text",
"system_message",
)
def __init__(self, document):
@ -123,7 +126,7 @@ class NodeToTextVisitor(docutils.nodes.NodeVisitor):
def visit_Text(self, node):
"""Keep this node text, this is typically what we want to spell check."""
self.output.append(node.rawsource)
self.output.append(docutils.nodes.unescape(node, restore_backslashes=True))
def __str__(self):
"""Give the accumulated strings."""

View File

@ -26,7 +26,7 @@ classifiers =
[options]
py_modules = pospell
python_requires = >= 3.6
install_requires = polib; docutils>=0.11,<0.18; regex
install_requires = polib; docutils>=0.16; regex
[options.entry_points]
console_scripts = pospell=pospell:main