pospell is a spellchecker for po files containing reStructuedText.
Go to file
rtobar 3553ecd726
Refactor pospell to use multiprocessing (#32)
One of the main drawbacks of pospell at the moment is that checking is
performed serially by a single hunspell process. In small projects this
is not noticeable, but in slightly bigger ones this can go up a bit
(e.g., in python-docs-es it takes ~2 minutes to check the whole set of
.po files).

The obvious solution to speed things up is to use multiprocessing,
parallelising the process at two different places: first, when reading
the input .po files and collecting the input strings to feed into
hunspell, and secondly when running hunspell itself.

This commit implements this support. It works as follows:

 * A new namedtuple called input_line has been added. It contains a
   filename, a line, and text, and thus it uniquely identifies an input
   line in a self-contained way.
 * When collecting input to feed into hunspell, the po_to_text routine
   collects input_lines instead of a simple string. This is done with a
   multiprocessing Pool to run in parallel across all input files.
 * The input_lines are split in N blocks, with N being the size of the
   pool. Note that during this process input_lines from different files
   might end up in the same block, and input_lines from the same file
   might end up in different blocks; however since input_lines are
   self-contained we are not losing information.
 * N hunspell instances are run over the N blocks of input_lines using
   the pool (only the text field from the input_lines is fed into
   hunspell).
 * When interpreting errors from hunspell we can match an input_line
   with its corresponding hunspell output lines, and thus can identify
   the original file:line that caused the error.

The multiprocessing pool is sized via a new -j/--jobs command line
option, which defaults to os.cpu_count() to run at maximum speed by
default.

These are the kind of differences I see with python-docs-es in my
machine, so YMMV depending on your setup/project:

$> time pospell -p dict2.txt -l es_ES */*.po -j 1
real    2m1.859s
user    2m6.680s
sys     0m3.829s

$> time pospell -p dict2.txt -l es_ES */*.po -j 2
real    1m10.322s
user    2m18.210s
sys     0m3.559s

Finally, these changes had some minor effects on the tooling around
testing. Pylint complained about there being too many arguments now in
check_spell, so pylint's max-args settings has been adjusted as
discussed. Separately, coverage information now needs to be collected
for sub-processes of the test main process; this is automatically done
by the pytest-cov plug-in, so I've switched tox to use that rather than
the more manual running of pytest under coverage (which would otherwise
require some extra setup to account for subprocesses).
2021-11-26 10:26:35 +01:00
.github Bump requirements. 2021-10-27 19:12:29 +02:00
tests Tox and github actions. (#24) 2020-11-23 14:26:34 +01:00
.gitignore Git ignore file 2018-07-27 14:57:43 +02:00
.pre-commit-hooks.yaml Add pre-commit hook (#14) 2020-05-22 17:48:57 +02:00
.pylintrc Refactor pospell to use multiprocessing (#32) 2021-11-26 10:26:35 +01:00
CHANGELOG.md Bump to v1.0.12. 2021-04-10 00:12:33 +02:00
README.md Bump requirements. 2021-10-27 19:12:29 +02:00
pospell.py Refactor pospell to use multiprocessing (#32) 2021-11-26 10:26:35 +01:00
pyproject.toml Tox and github actions. (#24) 2020-11-23 14:26:34 +01:00
setup.cfg Pleases pylint and mypy. 2021-10-27 17:24:27 +02:00
setup.py Move from setup.py to setup.cfg. 2020-11-23 12:56:58 +01:00
tox.ini Refactor pospell to use multiprocessing (#32) 2021-11-26 10:26:35 +01:00

README.md

pospell

pospell is a spellcheckers for po files containing reStructuedText.

Pospell is part of poutils!

Poutils (.po utils) is a metapackage to easily install useful Python tools to use with po files and pospell is a part of it! Go check out Poutils to discover the other tools!

Examples

By giving files to pospell:

$ pospell --language fr about.po
about.po:47:Jr.
about.po:55:reStructuredText
about.po:55:Docutils
about.po:63:Fredrik
about.po:63:Lundh
about.po:75:language
about.po:75:librarie

By using a bash expansion (note that we do not put quotes around *.po to let bash do its expansion):

$ pospell --language fr *.po
…

By using a glob pattern (note that we do put quotes around **/*.po to keep your shell from trying to expand it, we'll let Python do the expansion:

$ pospell --language fr --glob '**/*.po'
…

Usage

usage: pospell [-h] [-l LANGUAGE] [--glob GLOB] [--debug] [-p PERSONAL_DICT]
               [po_file [po_file ...]]

Check spelling in po files containing restructuredText.

positional arguments:
  po_file               Files to check, can optionally be mixed with --glob,
                        or not, use the one that fit your needs.

optional arguments:
  -h, --help            show this help message and exit
  -l LANGUAGE, --language LANGUAGE
                        Language to check, you'll have to install the
                        corresponding hunspell dictionary, on Debian see apt
                        list 'hunspell-*'.
  --glob GLOB           Provide a glob pattern, to be interpreted by pospell,
                        to find po files, like --glob '**/*.po'.
  --debug
  -p PERSONAL_DICT, --personal-dict PERSONAL_DICT

A personal dict (the -p option) is simply a text file with one word per line.

Contributing

You can work in a venv, to install the project locally:

python -m pip install .

And to test it locally:

python -m pip install tox
tox -p all