pospell is a spellchecker for po files containing reStructuedText.

Go to file

rtobar 3553ecd726 Refactor pospell to use multiprocessing (#32 ) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES /.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES /.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).		2021-11-26 10:26:35 +01:00
.github	Bump requirements.	2021-10-27 19:12:29 +02:00
tests	Tox and github actions. (#24 )	2020-11-23 14:26:34 +01:00
.gitignore	Git ignore file	2018-07-27 14:57:43 +02:00
.pre-commit-hooks.yaml	Add pre-commit hook (#14 )	2020-05-22 17:48:57 +02:00
.pylintrc	Refactor pospell to use multiprocessing (#32 )	2021-11-26 10:26:35 +01:00
CHANGELOG.md	Bump to v1.0.12.	2021-04-10 00:12:33 +02:00
README.md	Bump requirements.	2021-10-27 19:12:29 +02:00
pospell.py	Refactor pospell to use multiprocessing (#32 )	2021-11-26 10:26:35 +01:00
pyproject.toml	Tox and github actions. (#24 )	2020-11-23 14:26:34 +01:00
setup.cfg	Pleases pylint and mypy.	2021-10-27 17:24:27 +02:00
setup.py	Move from setup.py to setup.cfg.	2020-11-23 12:56:58 +01:00
tox.ini	Refactor pospell to use multiprocessing (#32 )	2021-11-26 10:26:35 +01:00

README.md

pospell

pospell is a spellcheckers for po files containing reStructuedText.

Pospell is part of poutils!

Poutils (.po utils) is a metapackage to easily install useful Python tools to use with po files and pospell is a part of it! Go check out Poutils to discover the other tools!

Examples

By giving files to pospell:

$ pospell --language fr about.po
about.po:47:Jr.
about.po:55:reStructuredText
about.po:55:Docutils
about.po:63:Fredrik
about.po:63:Lundh
about.po:75:language
about.po:75:librarie

By using a bash expansion (note that we do not put quotes around *.po to let bash do its expansion):

$ pospell --language fr *.po
…

By using a glob pattern (note that we do put quotes around **/*.po to keep your shell from trying to expand it, we'll let Python do the expansion:

$ pospell --language fr --glob '**/*.po'
…

Usage

usage: pospell [-h] [-l LANGUAGE] [--glob GLOB] [--debug] [-p PERSONAL_DICT]
               [po_file [po_file ...]]

Check spelling in po files containing restructuredText.

positional arguments:
  po_file               Files to check, can optionally be mixed with --glob,
                        or not, use the one that fit your needs.

optional arguments:
  -h, --help            show this help message and exit
  -l LANGUAGE, --language LANGUAGE
                        Language to check, you'll have to install the
                        corresponding hunspell dictionary, on Debian see apt
                        list 'hunspell-*'.
  --glob GLOB           Provide a glob pattern, to be interpreted by pospell,
                        to find po files, like --glob '**/*.po'.
  --debug
  -p PERSONAL_DICT, --personal-dict PERSONAL_DICT

A personal dict (the -p option) is simply a text file with one word per line.

Contributing

You can work in a venv, to install the project locally:

python -m pip install .

And to test it locally:

python -m pip install tox
tox -p all