3553ecd726
One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses). |
||
---|---|---|
.github | ||
tests | ||
.gitignore | ||
.pre-commit-hooks.yaml | ||
.pylintrc | ||
CHANGELOG.md | ||
README.md | ||
pospell.py | ||
pyproject.toml | ||
setup.cfg | ||
setup.py | ||
tox.ini |
README.md
pospell
pospell
is a spellcheckers for po files containing reStructuedText.
Pospell is part of poutils!
Poutils (.po
utils) is a metapackage to easily install useful Python tools to use with po files
and pospell
is a part of it! Go check out Poutils to discover the other tools!
Examples
By giving files to pospell
:
$ pospell --language fr about.po
about.po:47:Jr.
about.po:55:reStructuredText
about.po:55:Docutils
about.po:63:Fredrik
about.po:63:Lundh
about.po:75:language
about.po:75:librarie
By using a bash expansion (note that we do not put quotes around
*.po
to let bash do its expansion):
$ pospell --language fr *.po
…
By using a glob pattern (note that we do put quotes around **/*.po
to keep your shell from trying to expand it, we'll let Python do the
expansion:
$ pospell --language fr --glob '**/*.po'
…
Usage
usage: pospell [-h] [-l LANGUAGE] [--glob GLOB] [--debug] [-p PERSONAL_DICT]
[po_file [po_file ...]]
Check spelling in po files containing restructuredText.
positional arguments:
po_file Files to check, can optionally be mixed with --glob,
or not, use the one that fit your needs.
optional arguments:
-h, --help show this help message and exit
-l LANGUAGE, --language LANGUAGE
Language to check, you'll have to install the
corresponding hunspell dictionary, on Debian see apt
list 'hunspell-*'.
--glob GLOB Provide a glob pattern, to be interpreted by pospell,
to find po files, like --glob '**/*.po'.
--debug
-p PERSONAL_DICT, --personal-dict PERSONAL_DICT
A personal dict (the -p
option) is simply a text file with one word
per line.
Contributing
You can work in a venv, to install the project locally:
python -m pip install .
And to test it locally:
python -m pip install tox
tox -p all