The previous version of this code relied on the Text.rawsource attribute
to obtain the raw, original version of the translated texts contained in
.po files. This attribute however was removed in docutils 0.18, and thus
a different way of obtaining this information was needed.
(Note that this attribute removal was planned, but not for this release
yet: it's currently listed not in 0.18's list of changes, but under
"Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been
opened to get this eventually clarified)
The commit that removed the Text.rawsource mentioned that the data fed
into the Text elements was already the raw source, hence there was no
need to keep a separate attribute. Text objects derive from str, so we
can directly add them to the list of strings where NodeToTextVisitor
builds the original text, with the caveat that it needs to have
backslashes restored (they are encoded as null bytes after parsing,
apparently).
The other side-effect of using the Text objects directly instead of the
Text.rawsoource attribute is that now we get more of them. The document
resulting from docutils' parsing can contain system_message elements
with debugging information from the parsing process, such as warnings.
These are Text elements with no rawsource, but with actual text, so we
need to skip them. In the same spirit, citation_references and
substitution_references need to be ignored as well.
All these changes allow pospell to work against the latest docutils. On
the other hand, the lowest supported version is 0.16: 0.11 through 0.14
failed at rfc role parsing (used for example in the python docs), and
0.15 didn't have a method to restore backslashes (which again made the
python docs fail).
Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
At the moment pospell complains if invoked with a --glob pattern but
without any other po_files in the command line. This is a problem only
with the check, as the code is ready to handle the situation. To bypass
this problem, one *needs* to pass a po_file in the command-line as well,
even if the glob pattern contains it.
This commit adjusts the condition that checks that input files have been
somehow specified to consider --glob as a source of input files.
Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
One of the main drawbacks of pospell at the moment is that checking is
performed serially by a single hunspell process. In small projects this
is not noticeable, but in slightly bigger ones this can go up a bit
(e.g., in python-docs-es it takes ~2 minutes to check the whole set of
.po files).
The obvious solution to speed things up is to use multiprocessing,
parallelising the process at two different places: first, when reading
the input .po files and collecting the input strings to feed into
hunspell, and secondly when running hunspell itself.
This commit implements this support. It works as follows:
* A new namedtuple called input_line has been added. It contains a
filename, a line, and text, and thus it uniquely identifies an input
line in a self-contained way.
* When collecting input to feed into hunspell, the po_to_text routine
collects input_lines instead of a simple string. This is done with a
multiprocessing Pool to run in parallel across all input files.
* The input_lines are split in N blocks, with N being the size of the
pool. Note that during this process input_lines from different files
might end up in the same block, and input_lines from the same file
might end up in different blocks; however since input_lines are
self-contained we are not losing information.
* N hunspell instances are run over the N blocks of input_lines using
the pool (only the text field from the input_lines is fed into
hunspell).
* When interpreting errors from hunspell we can match an input_line
with its corresponding hunspell output lines, and thus can identify
the original file:line that caused the error.
The multiprocessing pool is sized via a new -j/--jobs command line
option, which defaults to os.cpu_count() to run at maximum speed by
default.
These are the kind of differences I see with python-docs-es in my
machine, so YMMV depending on your setup/project:
$> time pospell -p dict2.txt -l es_ES */*.po -j 1
real 2m1.859s
user 2m6.680s
sys 0m3.829s
$> time pospell -p dict2.txt -l es_ES */*.po -j 2
real 1m10.322s
user 2m18.210s
sys 0m3.559s
Finally, these changes had some minor effects on the tooling around
testing. Pylint complained about there being too many arguments now in
check_spell, so pylint's max-args settings has been adjusted as
discussed. Separately, coverage information now needs to be collected
for sub-processes of the test main process; this is automatically done
by the pytest-cov plug-in, so I've switched tox to use that rather than
the more manual running of pytest under coverage (which would otherwise
require some extra setup to account for subprocesses).