pospell/pospell.py

502 lines
15 KiB
Python
Raw Normal View History

2020-11-23 13:26:34 +00:00
"""pospell is a spellcheckers for po files containing reStructuedText."""
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
import collections
import functools
import io
2018-07-31 22:20:03 +00:00
import logging
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
import multiprocessing
import os
2018-07-23 15:37:50 +00:00
import subprocess
2018-07-28 22:58:20 +00:00
import sys
2019-08-20 14:38:03 +00:00
from contextlib import redirect_stderr
from itertools import chain
2018-07-23 15:37:50 +00:00
from pathlib import Path
from shutil import which
from string import digits
from typing import List, Tuple
from unicodedata import category
2018-07-27 09:38:17 +00:00
2018-07-27 19:57:44 +00:00
import docutils.frontend
import docutils.nodes
import docutils.parsers.rst
2018-07-28 22:58:20 +00:00
import polib
import regex
2018-07-27 19:57:44 +00:00
from docutils.parsers.rst import roles
from docutils.utils import new_document
from sphinxlint import rst
2018-07-27 19:57:44 +00:00
__version__ = "1.3"
DEFAULT_DROP_CAPITALIZED = {"fr": True, "fr_FR": True}
Error = Tuple[str, int, str]
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
input_line = collections.namedtuple("input_line", "filename line text")
class POSpellException(Exception):
2020-11-23 13:26:34 +00:00
"""All exceptions from this module inherit from this one."""
class Unreachable(POSpellException):
"""The code encontered a state that should be unreachable."""
2018-07-28 22:58:20 +00:00
try:
HUNSPELL_VERSION = subprocess.check_output(
["hunspell", "--version"], universal_newlines=True
2021-10-27 15:22:08 +00:00
).split("\n", maxsplit=1)[0]
2018-07-28 22:58:20 +00:00
except FileNotFoundError:
print("hunspell not found, please install hunspell.", file=sys.stderr)
sys.exit(1)
2018-07-28 22:58:20 +00:00
2018-07-27 19:57:44 +00:00
class DummyNodeClass(docutils.nodes.Inline, docutils.nodes.TextElement):
2020-11-23 13:26:34 +00:00
"""Used to represent any unknown roles, so we can parse any rst blindly."""
2018-07-27 19:57:44 +00:00
def monkey_patch_role(role):
2020-11-23 13:26:34 +00:00
"""Patch docutils.parsers.rst.roles.role so it always match.
Giving a DummyNodeClass for unknown roles.
"""
2018-07-27 19:57:44 +00:00
def role_or_generic(role_name, language_module, lineno, reporter):
base_role, message = role(role_name, language_module, lineno, reporter)
if base_role is None:
roles.register_generic_role(role_name, DummyNodeClass)
base_role, message = role(role_name, language_module, lineno, reporter)
return base_role, message
return role_or_generic
roles.role = monkey_patch_role(roles.role)
class NodeToTextVisitor(docutils.nodes.NodeVisitor):
2020-11-23 13:26:34 +00:00
"""Recursively convert a docutils node to a Python string.
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
Usage:
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
>>> visitor = NodeToTextVisitor(document)
>>> document.walk(visitor)
>>> print(str(visitor))
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
It ignores (see IGNORE_LIST) some nodes, which we don't want in
hunspell (enphasis typically contain proper names that are unknown
to dictionaires).
"""
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
IGNORE_LIST = (
"emphasis",
"superscript",
"title_reference",
Adjust raw text extraction from docutils documents (#33) The previous version of this code relied on the Text.rawsource attribute to obtain the raw, original version of the translated texts contained in .po files. This attribute however was removed in docutils 0.18, and thus a different way of obtaining this information was needed. (Note that this attribute removal was planned, but not for this release yet: it's currently listed not in 0.18's list of changes, but under "Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been opened to get this eventually clarified) The commit that removed the Text.rawsource mentioned that the data fed into the Text elements was already the raw source, hence there was no need to keep a separate attribute. Text objects derive from str, so we can directly add them to the list of strings where NodeToTextVisitor builds the original text, with the caveat that it needs to have backslashes restored (they are encoded as null bytes after parsing, apparently). The other side-effect of using the Text objects directly instead of the Text.rawsoource attribute is that now we get more of them. The document resulting from docutils' parsing can contain system_message elements with debugging information from the parsing process, such as warnings. These are Text elements with no rawsource, but with actual text, so we need to skip them. In the same spirit, citation_references and substitution_references need to be ignored as well. All these changes allow pospell to work against the latest docutils. On the other hand, the lowest supported version is 0.16: 0.11 through 0.14 failed at rfc role parsing (used for example in the python docs), and 0.15 didn't have a method to restore backslashes (which again made the python docs fail). Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
2021-11-30 16:57:04 +00:00
"substitution_reference",
"citation_reference",
2020-11-23 13:26:34 +00:00
"strong",
"DummyNodeClass",
"reference",
"literal",
"Text",
Adjust raw text extraction from docutils documents (#33) The previous version of this code relied on the Text.rawsource attribute to obtain the raw, original version of the translated texts contained in .po files. This attribute however was removed in docutils 0.18, and thus a different way of obtaining this information was needed. (Note that this attribute removal was planned, but not for this release yet: it's currently listed not in 0.18's list of changes, but under "Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been opened to get this eventually clarified) The commit that removed the Text.rawsource mentioned that the data fed into the Text elements was already the raw source, hence there was no need to keep a separate attribute. Text objects derive from str, so we can directly add them to the list of strings where NodeToTextVisitor builds the original text, with the caveat that it needs to have backslashes restored (they are encoded as null bytes after parsing, apparently). The other side-effect of using the Text objects directly instead of the Text.rawsoource attribute is that now we get more of them. The document resulting from docutils' parsing can contain system_message elements with debugging information from the parsing process, such as warnings. These are Text elements with no rawsource, but with actual text, so we need to skip them. In the same spirit, citation_references and substitution_references need to be ignored as well. All these changes allow pospell to work against the latest docutils. On the other hand, the lowest supported version is 0.16: 0.11 through 0.14 failed at rfc role parsing (used for example in the python docs), and 0.15 didn't have a method to restore backslashes (which again made the python docs fail). Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
2021-11-30 16:57:04 +00:00
"system_message",
2020-11-23 13:26:34 +00:00
)
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
def __init__(self, document):
"""Initialize visitor for the given node/document."""
self.output = []
super().__init__(document)
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
def unknown_visit(self, node):
"""Mandatory implementation to visit unknwon nodes."""
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
@staticmethod
def ignore(node):
"""Just raise SkipChildren.
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
Used for all visit_* in the IGNORE_LIST.
2019-07-26 15:40:48 +00:00
2020-11-23 13:26:34 +00:00
See __getattr__.
"""
2019-07-26 15:40:48 +00:00
raise docutils.nodes.SkipChildren
2020-11-23 13:26:34 +00:00
def __getattr__(self, name):
"""Skip childrens from the IGNORE_LIST."""
if name.startswith("visit_") and name[6:] in self.IGNORE_LIST:
return self.ignore
raise AttributeError(name)
2018-07-27 19:57:44 +00:00
def visit_Text(self, node):
2020-11-23 13:26:34 +00:00
"""Keep this node text, this is typically what we want to spell check."""
Adjust raw text extraction from docutils documents (#33) The previous version of this code relied on the Text.rawsource attribute to obtain the raw, original version of the translated texts contained in .po files. This attribute however was removed in docutils 0.18, and thus a different way of obtaining this information was needed. (Note that this attribute removal was planned, but not for this release yet: it's currently listed not in 0.18's list of changes, but under "Future changes". https://sourceforge.net/p/docutils/bugs/437/ has been opened to get this eventually clarified) The commit that removed the Text.rawsource mentioned that the data fed into the Text elements was already the raw source, hence there was no need to keep a separate attribute. Text objects derive from str, so we can directly add them to the list of strings where NodeToTextVisitor builds the original text, with the caveat that it needs to have backslashes restored (they are encoded as null bytes after parsing, apparently). The other side-effect of using the Text objects directly instead of the Text.rawsoource attribute is that now we get more of them. The document resulting from docutils' parsing can contain system_message elements with debugging information from the parsing process, such as warnings. These are Text elements with no rawsource, but with actual text, so we need to skip them. In the same spirit, citation_references and substitution_references need to be ignored as well. All these changes allow pospell to work against the latest docutils. On the other hand, the lowest supported version is 0.16: 0.11 through 0.14 failed at rfc role parsing (used for example in the python docs), and 0.15 didn't have a method to restore backslashes (which again made the python docs fail). Signed-off-by: Rodrigo Tobar <rtobar@icrar.org>
2021-11-30 16:57:04 +00:00
self.output.append(docutils.nodes.unescape(node, restore_backslashes=True))
2018-07-27 19:57:44 +00:00
def __str__(self):
2020-11-23 13:26:34 +00:00
"""Give the accumulated strings."""
2018-07-27 19:57:44 +00:00
return " ".join(self.output)
2018-07-23 15:37:50 +00:00
def strip_rst(line):
2020-11-23 13:26:34 +00:00
"""Transform reStructuredText to plain text."""
2018-07-27 19:57:44 +00:00
if line.endswith("::"):
# Drop :: at the end, it would cause Literal block expected
line = line[:-2]
line = rst.NORMAL_ROLE_RE.sub("", line)
2023-07-19 08:46:04 +00:00
settings = docutils.frontend.get_default_settings()
settings.pep_references = None
settings.rfc_references = None
settings.pep_base_url = "http://www.python.org/dev/peps/"
settings.pep_file_url_template = "pep-%04d"
2018-07-27 19:57:44 +00:00
parser = docutils.parsers.rst.Parser()
stderr_stringio = io.StringIO()
with redirect_stderr(stderr_stringio):
2019-08-20 14:38:03 +00:00
document = new_document("<rst-doc>", settings=settings)
2018-07-27 19:57:44 +00:00
parser.parse(line, document)
stderr = stderr_stringio.getvalue()
if stderr:
print(stderr.strip(), "while parsing:", line)
visitor = NodeToTextVisitor(document)
document.walk(visitor)
return str(visitor)
2018-07-23 15:37:50 +00:00
def clear(line, drop_capitalized=False, po_path=""):
2020-10-11 13:33:09 +00:00
"""Clear various other syntaxes we may encounter in a line."""
# Normalize spaces
line = regex.sub(r"\s+", " ", line).replace("\xad", "")
2018-07-31 22:20:03 +00:00
to_drop = {
r'<a href="[^"]*?">',
2019-11-16 13:47:22 +00:00
r"{[a-z_]*?}", # Sphinx variable
r"%\([a-z_]+?\)[diouxXeEfFgGcrsa%]", # Sphinx variable
r"« . »", # Single letter examples (typically in Unicode documentation)
2018-07-31 22:20:03 +00:00
}
if drop_capitalized:
to_drop.add(
# Strip capitalized words in sentences
r"(?<!\. |^|-)\b(\p{Letter}['])?\b\p{Uppercase}\p{Letter}[\w.-]*\b"
)
2018-07-31 22:20:03 +00:00
if logging.getLogger().isEnabledFor(logging.DEBUG):
for pattern in to_drop:
for dropped in regex.findall(pattern, line):
logging.debug(
"%s: dropping %r via %r due to from %r",
po_path,
dropped,
pattern,
line,
)
return regex.sub("|".join(to_drop), r" ", line)
def quote_for_hunspell(text):
2020-11-23 13:26:34 +00:00
"""Quote a paragraph so hunspell don't misinterpret it.
Quoting the manpage:
It is recommended that programmatic interfaces prefix
every data line with an uparrow to protect themselves
2020-11-23 13:26:34 +00:00
against future changes in hunspell.
"""
out = []
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
for line in text:
out.append("^" + line if line else "")
return "\n".join(out)
def po_to_text(po_path, drop_capitalized=False):
2020-11-23 13:26:34 +00:00
"""Convert a po file to a text file.
This strips the msgids and all po syntax while keeping lines at
their same position / line number.
"""
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
input_lines = []
2018-07-23 17:24:10 +00:00
lines = 0
try:
2021-10-27 15:22:08 +00:00
entries = polib.pofile(Path(po_path).read_text(encoding="UTF-8"))
except Exception as err:
raise POSpellException(str(err)) from err
2018-07-23 15:37:50 +00:00
for entry in entries:
if entry.msgid == entry.msgstr:
continue
2023-04-10 14:17:11 +00:00
if entry.obsolete:
continue
2018-07-23 17:24:10 +00:00
while lines < entry.linenum:
lines += 1
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
input_lines.append(input_line(po_path, lines, ""))
2018-07-23 17:24:10 +00:00
lines += 1
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
input_lines.append(
input_line(
po_path,
lines,
clear(strip_rst(entry.msgstr), drop_capitalized, po_path=po_path),
)
)
return input_lines
2018-07-23 15:37:50 +00:00
2018-07-28 22:58:20 +00:00
def parse_args():
2020-10-11 13:33:09 +00:00
"""Parse command line arguments."""
2018-07-23 15:37:50 +00:00
import argparse
2018-07-23 15:37:50 +00:00
parser = argparse.ArgumentParser(
description="Check spelling in po files containing restructuredText."
)
parser.add_argument(
"-l",
"--language",
type=str,
default="fr",
help="Language to check, you'll have to install the corresponding "
"hunspell dictionary, on Debian see apt list 'hunspell-*' (defaults to 'fr').",
)
parser.add_argument(
"--glob",
type=str,
help="Provide a glob pattern, to be interpreted by pospell, to find po files, "
"like --glob '**/*.po'.",
)
2019-10-09 11:07:09 +00:00
parser.add_argument(
"--drop-capitalized",
2019-10-09 11:07:09 +00:00
action="store_true",
2020-11-23 13:26:34 +00:00
help="Always drop capitalized words in sentences"
" (defaults according to the language).",
)
parser.add_argument(
"--no-drop-capitalized",
action="store_true",
2020-11-23 13:26:34 +00:00
help="Never drop capitalized words in sentences"
" (defaults according to the language).",
2019-10-09 11:07:09 +00:00
)
parser.add_argument(
"po_file",
nargs="*",
type=Path,
help="Files to check, can optionally be mixed with --glob, or not, "
"use the one that fit your needs.",
)
2018-07-31 22:20:03 +00:00
parser.add_argument(
"-v",
"--verbose",
action="count",
default=0,
help="More output, use -vv, -vvv, and so on.",
)
2018-07-28 22:58:20 +00:00
parser.add_argument(
"--version",
action="version",
version="%(prog)s " + __version__ + " using hunspell: " + HUNSPELL_VERSION,
)
parser.add_argument("--debug", action="store_true")
2023-04-10 14:58:31 +00:00
parser.add_argument("-p", "--personal-dict", type=Path)
parser.add_argument(
"--modified", "-m", action="store_true", help="Use git to find modified files."
)
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
parser.add_argument(
"-j",
"--jobs",
type=int,
default=os.cpu_count(),
help="Number of files to check in paralel, defaults to all available CPUs",
)
args = parser.parse_args()
2023-04-10 14:58:31 +00:00
if args.personal_dict is not None and not args.personal_dict.exists():
print(f"Error: dictionary {str(args.personal_dict)!r} not found.")
sys.exit(1)
if args.drop_capitalized and args.no_drop_capitalized:
print("Error: don't provide both --drop-capitalized AND --no-drop-capitalized.")
parser.print_help()
sys.exit(1)
if not args.po_file and not args.modified and not args.glob:
parser.print_help()
sys.exit(1)
return args
2018-07-28 22:58:20 +00:00
def look_like_a_word(word):
2020-11-23 13:26:34 +00:00
"""Return True if the given str looks like a word.
Used to filter out non-words like `---` or `-0700` so they don't
get reported. They typically are not errors.
"""
if not word:
return False
if any(digit in word for digit in digits):
return False
if len([c for c in word if category(c) == "Lu"]) > 1:
return False # Probably an accronym, or a name like CPython, macOS, SQLite, ...
if "-" in word:
return False
return True
def run_hunspell(language, personal_dict, input_lines) -> List[Error]:
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
"""Run hunspell over the given input lines."""
personal_dict_arg = ["-p", personal_dict] if personal_dict else []
try:
output = subprocess.check_output(
["hunspell", "-d", language, "-a"] + personal_dict_arg,
universal_newlines=True,
input=quote_for_hunspell(text for _, _, text in input_lines),
)
except subprocess.CalledProcessError:
2023-04-10 14:58:31 +00:00
return []
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
return parse_hunspell_output(input_lines, output.splitlines())
def flatten(list_of_lists):
"""[[a,b,c], [d,e,f]] -> [a,b,c,d,e,f]."""
return [element for a_list in list_of_lists for element in a_list]
def spell_check(
po_files,
personal_dict=None,
language="en_US",
drop_capitalized=False,
debug_only=False,
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
jobs=os.cpu_count(),
):
2020-11-23 13:26:34 +00:00
"""Check for spelling mistakes in the given po_files.
(po format, containing restructuredtext), for the given language.
personal_dict allow to pass a personal dict (-p) option, to hunspell.
Debug only will show what's passed to Hunspell instead of passing it.
2018-07-28 22:58:20 +00:00
"""
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
# Pool.__exit__ calls terminate() instead of close(), we need the latter,
# which ensures the processes' atexit handlers execute fully, which in
# turn lets coverage write the sub-processes' coverage information
pool = multiprocessing.Pool(jobs) # pylint: disable=consider-using-with
try:
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
input_lines = flatten(
pool.map(
functools.partial(po_to_text, drop_capitalized=drop_capitalized),
po_files,
)
)
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
if debug_only:
for filename, line, text in input_lines:
print(filename, line, text, sep=":")
return 0
if not input_lines:
return 0
# Distribute input lines across workers
lines_per_job = (len(input_lines) + jobs - 1) // jobs
chunked_inputs = [
input_lines[i : i + lines_per_job]
for i in range(0, len(input_lines), lines_per_job)
]
errors = flatten(
pool.map(
functools.partial(run_hunspell, language, personal_dict),
chunked_inputs,
)
)
finally:
pool.close()
pool.join()
for error in errors:
print(*error, sep=":")
return len(errors)
def parse_hunspell_output(inputs, outputs) -> List[Error]:
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
"""Parse `hunspell -a` output and collect all errors."""
# skip first line of hunspell output (it's the banner)
outputs = iter(outputs[1:])
errors = []
for po_input_line, output_line in zip(inputs, outputs):
if not po_input_line.text:
continue
while output_line:
if output_line.startswith("&"):
_, original, *_ = output_line.split()
if look_like_a_word(original):
errors.append(
(po_input_line.filename, po_input_line.line, original)
)
try:
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
output_line = next(outputs)
except StopIteration:
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
break
return errors
def gracefull_handling_of_missing_dicts(language):
2020-10-11 13:33:09 +00:00
"""Check if hunspell dictionary for given language is installed."""
hunspell_dash_d = subprocess.check_output(
["hunspell", "-D"], universal_newlines=True, stderr=subprocess.STDOUT
)
languages = {Path(line).name for line in hunspell_dash_d}
def error(*args, file=sys.stderr, **kwargs):
print(*args, file=file, **kwargs)
if language in languages:
return
error(
"The hunspell dictionary for your language is missing, please install it.",
end="\n\n",
)
if which("apt"):
error("Maybe try something like:")
2021-10-27 15:22:08 +00:00
error(f" sudo apt install hunspell-{language}")
else:
error(
2021-10-27 15:22:08 +00:00
f"""I don't know your environment, but I bet the package name looks like:
hunspell-{language}
If you find it, please tell me (by opening an issue or a PR on
https://github.com/JulienPalard/pospell/) so I can enhance this error message.
2021-10-27 15:22:08 +00:00
"""
)
sys.exit(1)
def main():
2020-11-23 13:26:34 +00:00
"""Entry point (for command-line)."""
args = parse_args()
logging.basicConfig(level=50 - 10 * args.verbose)
default_drop_capitalized = DEFAULT_DROP_CAPITALIZED.get(args.language, False)
if args.drop_capitalized:
drop_capitalized = True
elif args.no_drop_capitalized:
drop_capitalized = False
else:
drop_capitalized = default_drop_capitalized
args.po_file = list(
chain(Path(".").glob(args.glob) if args.glob else [], args.po_file)
)
if args.modified:
git_status = subprocess.check_output(
["git", "status", "--porcelain", "--no-renames"], encoding="utf-8"
)
git_status_lines = [
line.split(maxsplit=2) for line in git_status.split("\n") if line
]
args.po_file.extend(
Path(filename)
for status, filename in git_status_lines
if filename.endswith(".po") and status != "D"
)
try:
errors = spell_check(
args.po_file,
args.personal_dict,
args.language,
drop_capitalized,
args.debug,
Refactor pospell to use multiprocessing (#32) One of the main drawbacks of pospell at the moment is that checking is performed serially by a single hunspell process. In small projects this is not noticeable, but in slightly bigger ones this can go up a bit (e.g., in python-docs-es it takes ~2 minutes to check the whole set of .po files). The obvious solution to speed things up is to use multiprocessing, parallelising the process at two different places: first, when reading the input .po files and collecting the input strings to feed into hunspell, and secondly when running hunspell itself. This commit implements this support. It works as follows: * A new namedtuple called input_line has been added. It contains a filename, a line, and text, and thus it uniquely identifies an input line in a self-contained way. * When collecting input to feed into hunspell, the po_to_text routine collects input_lines instead of a simple string. This is done with a multiprocessing Pool to run in parallel across all input files. * The input_lines are split in N blocks, with N being the size of the pool. Note that during this process input_lines from different files might end up in the same block, and input_lines from the same file might end up in different blocks; however since input_lines are self-contained we are not losing information. * N hunspell instances are run over the N blocks of input_lines using the pool (only the text field from the input_lines is fed into hunspell). * When interpreting errors from hunspell we can match an input_line with its corresponding hunspell output lines, and thus can identify the original file:line that caused the error. The multiprocessing pool is sized via a new -j/--jobs command line option, which defaults to os.cpu_count() to run at maximum speed by default. These are the kind of differences I see with python-docs-es in my machine, so YMMV depending on your setup/project: $> time pospell -p dict2.txt -l es_ES */*.po -j 1 real 2m1.859s user 2m6.680s sys 0m3.829s $> time pospell -p dict2.txt -l es_ES */*.po -j 2 real 1m10.322s user 2m18.210s sys 0m3.559s Finally, these changes had some minor effects on the tooling around testing. Pylint complained about there being too many arguments now in check_spell, so pylint's max-args settings has been adjusted as discussed. Separately, coverage information now needs to be collected for sub-processes of the test main process; this is automatically done by the pytest-cov plug-in, so I've switched tox to use that rather than the more manual running of pytest under coverage (which would otherwise require some extra setup to account for subprocesses).
2021-11-26 09:26:35 +00:00
args.jobs,
)
except POSpellException as err:
print(err, file=sys.stderr)
sys.exit(-1)
if errors == -1:
gracefull_handling_of_missing_dicts(args.language)
sys.exit(0 if errors == 0 else -1)
2018-07-23 15:37:50 +00:00
if __name__ == "__main__":
2018-07-23 15:37:50 +00:00
main()