Adjust raw text extraction from docutils documents (#33)

The previous version of this code relied on the Text.rawsource attribute to obtain the raw, original version of the translated texts contained in .po files. This attribute however was removed in docutils 0.18, and thus a different way of obtaining this information was needed. (Note that this attribute removal was planned, but not for this release yet: it's currently listed not in 0.18's list of changes, but under "Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been opened to get this eventually clarified) The commit that removed the Text.rawsource mentioned that the data fed into the Text elements was already the raw source, hence there was no need to keep a separate attribute. Text objects derive from str, so we can directly add them to the list of strings where NodeToTextVisitor builds the original text, with the caveat that it needs to have backslashes restored (they are encoded as null bytes after parsing, apparently). The other side-effect of using the Text objects directly instead of the Text.rawsoource attribute is that now we get more of them. The document resulting from docutils' parsing can contain system_message elements with debugging information from the parsing process, such as warnings. These are Text elements with no rawsource, but with actual text, so we need to skip them. In the same spirit, citation_references and substitution_references need to be ignored as well. All these changes allow pospell to work against the latest docutils. On the other hand, the lowest supported version is 0.16: 0.11 through 0.14 failed at rfc role parsing (used for example in the python docs), and 0.15 didn't have a method to restore backslashes (which again made the python docs fail). Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
2021-12-01 00:57:04 +08:00 · 2021-12-01 00:57:04 +08:00 · c4feb4d25f
parent 2844284bb7
commit c4feb4d25f
2 changed files with 5 additions and 2 deletions
--- a/pospell.py
+++ b/pospell.py
@ -90,11 +90,14 @@ class NodeToTextVisitor(docutils.nodes.NodeVisitor):
        "emphasis",
        "superscript",
        "title_reference",
+        "substitution_reference",
+        "citation_reference",
        "strong",
        "DummyNodeClass",
        "reference",
        "literal",
        "Text",
+        "system_message",
    )

    def __init__(self, document):
@ -123,7 +126,7 @@ class NodeToTextVisitor(docutils.nodes.NodeVisitor):

    def visit_Text(self, node):
        """Keep this node text, this is typically what we want to spell check."""
-        self.output.append(node.rawsource)
+        self.output.append(docutils.nodes.unescape(node, restore_backslashes=True))

    def __str__(self):
        """Give the accumulated strings."""
--- a/setup.cfg
+++ b/setup.cfg
@ -26,7 +26,7 @@ classifiers =
 [options]
 py_modules = pospell
 python_requires = >= 3.6
-install_requires = polib; docutils>=0.11,<0.18; regex
+install_requires = polib; docutils>=0.16; regex

 [options.entry_points]
 console_scripts = pospell=pospell:main