Initial commit.

This commit is contained in:
Julien Palard 2024-03-28 08:56:51 +01:00
commit 6f7e52a42b
Signed by: mdk
GPG Key ID: 0EFC1AC1006886F8
6 changed files with 646 additions and 0 deletions

3
.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@
__pycache__/
.venv/
.envrc

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
The MIT License (MIT)
Copyright (c) 2024 Julien Palard
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

128
README.md Normal file
View File

@ -0,0 +1,128 @@
# Parseur de relevés BoursoBank
⚠ Cette bibliothèque a été développée indémendament de BoursoBank.
## Installation
pip install boursobank
## Sécurité
### Mot de passe
Cette bibliothèque ne **se connecte pas à internet** (dans le doute,
lis le code) elle ne fait que lire des relevés au format PDF déjà
téléchargés, tous les traitements sont effectés en local.
Dans le doute il doit être possible de faire tourner lapplication
dans [firejail](https://github.com/netblue30/firejail) ou similaire.
Il nest donc pas nécessaire de sinquiéter pour son mot de passe : il
nest pas demandé (là, pas besoin de relire le code : si la lib ne
demande pas le mot de passe… elle ne la pas).
### Erreurs du parseur
Lire des PDF [nest pas simple](https://pypdf.readthedocs.io/en/stable/user/extract-text.html#ocr-vs-text-extraction).
Pour sassurer de ne pas introduire derreur dans vos analyses, cette
bibliothèque fournit une méthode `validate()` qui valide que le
montant initial + toutes les lignes donne bien le montant final, sans
quoi une `ValueError` est levée.
Cet exemple ne lévera donc une exception quen cas derreur danalyse
(ou de la banque, comme au monopoly) :
```python
for file in args.files:
statement = Statement.from_pdf(file)
statement.pretty_print()
statement.validate()
```
## Interface en ligne de commande
Cette lib est utilisable en ligne de commande :
boursobank *.pdf
vous affichera vos relevés (CB ou compte), exemple :
$ boursobank 2024-01.pdf
2024-01.pdf
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Date ┃ RIB ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 2024-01-01 │ 12345 12345 00000000000 99 │
└────────────┴────────────────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Label ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ VIR SEPA Truc │ 42.42 │
│ VIR SEPA Machin truc │ 99.00 │
│ Relevé différé Carte 4810********0000 │ -123.45 │
└──────────────────────────────────────────┴──────────┘
## API
Tout lintérêt est de pouvoir consulter ses relevés en Python, par
exemple un export en CSV :
```
import argparse
import csv
import sys
from pathlib import Path
from boursobank import Statement
def main():
args = parse_args()
statement = Statement.from_pdf(args.ifile)
writer = csv.writer(sys.stdout)
for line in statement.lines:
writer.writerow((line.label, line.value))
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("ifile", type=Path, help="PDF file")
return parser.parse_args()
if __name__ == "__main__":
main()
```
La bibliothèque ne fournit quun point dentrée : la classe `Statement`
Depuis cette classe il est possible de parser des PDF :
relevé_bancaire = Statement.from_pdf("test.pdf")
ou du texte :
relevé_bancaire = Statement.from_text("blah blah")
Cette classe fournit principalement deux attributs, un dictionnaire `headers` contenant :
- `date` : le 1° jour du mois couvert par ce relevé.
- `emit_date` : la date à laquelle le relevé a été rédigé.
- `RIB` : le RIB/IBAN du relevé.
- `devise` : probablement `"EUR"`.
- `card_number` : le numéro de carte bleu si cest un relevé de carte.
- `card_owner` : le nom du possesseur de la carte bleu si cest un relevé de carte.
et un attribut lines contenant des instances de la classe `Line` dont
les attributs principaux sont :
- `label` : la description courte de la ligne.
- `description` : la suite de la description de la ligne si elle est sur plusieurs lignes.
- `value` : le montant de la ligne (positif pour un crédit, négatif pour un débit).

384
boursobank.py Normal file
View File

@ -0,0 +1,384 @@
"""Parses BoursoBank account statements."""
import datetime as dt
import logging
import re
from decimal import Decimal
from pypdf import PdfReader
from rich.console import Console
from rich.table import Table
from rich import print as rich_print
from rich.panel import Panel
__version__ = "0.1"
DATE_RE = r"([0-9]{1,2}/[0-9]{2}/[0-9]{2,4})"
HEADER_VALUE_PATTERN = rf"""\s*
(?P<date>{DATE_RE})\s+
(?P<RIB>[0-9]{{5}}\s+[0-9]{{5}}\s+[0-9]{{11}}\s+[0-9]{{2}})\s+
(
(?P<devise>[A-Z]{{3}})
|
(?P<card_number>[0-9]{{4}}\*{{8}}[0-9]{{4}})
)\s+
(?P<periode>(du)?\s+{DATE_RE}\s+(au\s+)?{DATE_RE})\s+
"""
RE_CARD_OWNER = [ # First pattern is tried first
re.compile(r"Porteur\s+de\s+la\s+carte\s+:\s+(?P<porteur>.*)$", flags=re.M),
re.compile(
r"44\s+rue\s+Traversiere\s+CS\s+80134\s+92772\s+"
r"Boulogne-Billancourt\s+Cedex\s+(?P<porteur>.*)$",
flags=re.M,
),
]
logger = logging.getLogger(__name__)
def parse_decimal(value: str):
"""Parse a French value like 1.234,56 to a Decimal instance."""
return Decimal(value.replace(".", "").replace(",", "."))
class Line:
"""Represents one line (debit or credit) in a bank statement."""
PATTERN = re.compile(
rf"\s+(?P<date>{DATE_RE})\s*(?P<label>.*)\s+"
rf"(?P<valeur>{DATE_RE})\s+(?P<amount>[0-9.,]+)$"
)
def __init__(self, statement, line):
self.statement = statement
self.line = line
self.description = ""
self.match = self.PATTERN.match(line)
@property
def label(self):
"""Line short description."""
return re.sub(r"\s+", " ", self.match["label"]).strip()
@property
def safe_label(self):
"""Line short description without double quotes."""
return self.label.replace('"', "")
def add_description(self, description_line):
"""Add a line to a long description."""
self.description += description_line
@property
def direction(self):
"""returns '-' for outbound, and '+' for inbound.
There's two columns in the PDF: Débit, Crédit.
Sadly we don't really know where they are, and there's
variations depending on the format, so we have to use an
heuristic.
"""
if self.statement.headers["date"] < dt.date(2021, 1, 1):
column_at = 98
else:
column_at = 225
column = self.match.start("amount")
return "-" if column < column_at else "+"
@property
def amount(self):
"""Raw value for this line, dependless of its 'debit'/'credit' column"""
return parse_decimal(self.match["amount"])
@property
def value(self):
"""Value for this line. Positive for credits, negative for debits."""
return self.amount if self.direction == "+" else -self.amount
def __str__(self):
return f"{self.safe_label} {self.value}"
class AccountLine(Line):
"""Represents one line (debit or credit) in a bank statement."""
PATTERN = re.compile(
rf"\s+(?P<date>{DATE_RE})\s*(?P<label>.*)\s+"
rf"(?P<valeur>{DATE_RE})\s+(?P<amount>[0-9.,]+)$"
)
class BalanceBeforeLine(AccountLine):
PATTERN = re.compile(rf"\s+SOLDE\s+AU\s+:\s+{DATE_RE}\s+(?P<amount>[0-9,.]+)$")
class BalanceAfterLine(AccountLine):
PATTERN = re.compile(r"\s+Nouveau\s+solde\s+en\s+EUR\s+:\s+(?P<amount>[0-9,.]+)$")
class CardLine(Line):
"""Represents one line (debit or credit) in a card statement."""
PATTERN = re.compile(
rf"\s*(?P<date>{DATE_RE})\s+CARTE\s+(?P<valeur>{DATE_RE})"
rf"\s+(?P<label>.*)\s+(?P<amount>[0-9.,]+)$"
)
@property
def direction(self):
"""returns '-' for outbound, and '+' for inbound.
As it's a card, we have only one column: debits.
"""
return "-"
class CardLineDebit(CardLine):
PATTERN = re.compile(
rf"\s+A\s+VOTRE\s+DEBIT\s+LE\s+{DATE_RE}\s+(?P<amount>[0-9.,]+)$"
)
class CardLineDebitWithFrancs(CardLineDebit):
PATTERN = re.compile(
rf"\s+A\s+VOTRE\s+DEBIT\s+LE\s+{DATE_RE}\s+"
rf"(?P<amount>[0-9.,]+)\s+(?P<debit_francs>[0-9.,]+)$"
)
class CardLineWithFrancs(CardLine):
"""Represents one line (debit or credit) in a card statement."""
PATTERN = re.compile(
rf"\s*(?P<date>{DATE_RE})\s+CARTE\s+(?P<label>.*)\s+"
rf"(?P<amount>[0-9.,]+)\s+(?P<amount_francs>[0-9.,]+)$"
)
class Statement:
"""Represents a bank account statement."""
LineImpl = Line
def __init__(self, filename, text, headers, **kwargs):
self.filename = filename
self.text = text
self.headers = headers
self.lines = []
super().__init__(**kwargs)
@classmethod
def from_string(cls, string, filename="-"):
"""Builds a statement from a string, usefull for tests purposes."""
headers = cls._parse_header(string, filename)
if headers.get("card_number"):
self = CardStatement(filename=filename, text=string, headers=headers)
else:
self = AccountStatement(filename=filename, text=string, headers=headers)
self._parse()
return self
@classmethod
def from_pdf(cls, filename):
"""Builds a statement from a PDF file."""
buf = []
for page in PdfReader(filename).pages:
try:
buf.append(
page.extract_text(extraction_mode="layout", orientations=[0])
)
except AttributeError:
# Maybe just a blank page
pass # logger.exception("while parsing PDF %s", filename)
return cls.from_string("\n".join(buf), filename)
@classmethod
def _parse_header(cls, text: str, filename: str) -> dict:
headers = {}
for text_line in text.splitlines():
if values := re.match(HEADER_VALUE_PATTERN, text_line, re.VERBOSE):
headers["emit_date"] = dt.datetime.strptime(
values["date"], "%d/%m/%Y"
).date()
headers["date"] = (
dt.datetime.strptime(values["periode"].split()[-1], "%d/%m/%Y")
.date()
.replace(day=1)
)
headers["RIB"] = re.sub(r"\s+", " ", values["RIB"])
headers["devise"] = values["devise"]
headers["card_number"] = values["card_number"]
break
else:
logger.warning("Cannot find header values in %s.", filename)
return {}
return headers
def _parse_lines(self):
current_line = None
for text_line in self.text.splitlines():
line = self.LineImpl(self, text_line)
if line.match:
if current_line:
self.lines.append(current_line)
current_line = line
elif current_line:
current_line.add_description(text_line)
if current_line:
self.lines.append(current_line)
def __str__(self):
buf = [f"Date: {self.headers['date']}", f"RIB: {self.headers['RIB']}"]
for line in self.lines:
buf.append(str(line))
return "\n".join(buf)
class AccountStatement(Statement):
LineImpl = AccountLine
def __init__(self, **kwargs):
self.balance_before = Decimal(0)
self.balance_after = Decimal(0)
super().__init__(**kwargs)
def validate(self):
"""Consistency check.
It just verifies that all the lines sum to the right total.
"""
computed = sum(line.value for line in self.lines)
if self.balance_before + computed != self.balance_after:
raise ValueError(
f"Inconsistent total, found: {self.balance_before + computed!r}, "
f"expected: {self.balance_after!r} in {self.filename}."
)
def _parse(self):
self._parse_soldes()
self._parse_lines()
def _parse_soldes(self):
for text in self.text.splitlines():
line = BalanceBeforeLine(self, text)
if line.match:
self.balance_before = line.value
line = BalanceAfterLine(self, text)
if line.match:
self.balance_after = line.value
def pretty_print(self):
table = Table(title=str(self.filename))
table.add_column("Date")
table.add_column("RIB")
table.add_row(str(self.headers["date"]), self.headers["RIB"])
Console().print(table)
table = Table()
table.add_column("Label", justify="right", style="cyan", no_wrap=True)
table.add_column("Value", style="magenta")
for line in self.lines:
table.add_row(line.label, str(line.value))
Console().print(table)
class CardStatement(Statement):
LineImpl = CardLine
def __init__(self, **kwargs):
self.card_debit = Decimal(0)
super().__init__(**kwargs)
def validate(self):
"""Consistency check.
It just verifies that all the lines sum to the right total.
"""
computed = sum(line.value for line in self.lines)
if computed != self.card_debit:
raise ValueError(
f"Inconsistent total, found: {computed!r}, "
f"expected: {self.card_debit!r} in {self.filename}."
)
def _parse(self):
self._parse_card_owner()
self._parse_card_debit()
self._parse_lines()
def _parse_card_debit(self):
for text in self.text.splitlines():
line = CardLineDebitWithFrancs(self, text)
if line.match:
self.card_debit = line.value
self.LineImpl = CardLineWithFrancs
return
line = CardLineDebit(self, text)
if line.match:
self.card_debit = line.value
return
def _parse_card_owner(self):
for pattern in RE_CARD_OWNER:
if match := pattern.search(self.text):
self.headers["card_owner"] = re.sub(r"\s+", " ", match["porteur"])
break
def pretty_print(self):
table = Table(title=str(self.filename))
table.add_column("Date")
table.add_column("RIB")
table.add_column("Card number")
table.add_column("Card debit")
table.add_column("Card owner")
table.add_row(
str(self.headers["date"]),
self.headers["RIB"],
self.headers["card_number"],
str(self.card_debit),
self.headers["card_owner"],
)
Console().print(table)
table = Table()
table.add_column("Label", justify="right", style="cyan", no_wrap=True)
table.add_column("Value", style="magenta")
for line in self.lines:
table.add_row(line.label, str(line.value))
Console().print(table)
def main():
args = parse_args()
logging.getLogger("pypdf._text_extraction._layout_mode._fixed_width_page").setLevel(
logging.ERROR
)
for file in args.files:
statement = Statement.from_pdf(file)
if args.debug:
rich_print(Panel(statement.text))
statement.pretty_print()
statement.validate()
def parse_args():
import argparse
from pathlib import Path
parser = argparse.ArgumentParser()
parser.add_argument("-d", "--debug", action="store_true")
parser.add_argument("files", nargs="*", type=Path)
return parser.parse_args()
if __name__ == "__main__":
main()

22
pyproject.toml Normal file
View File

@ -0,0 +1,22 @@
[build-system]
requires = ["flit_core >=3.2,<4"]
build-backend = "flit_core.buildapi"
[project]
name = "boursobank"
authors = [{name = "Julien Palard", email = "julien@palard.fr"}]
license = {file = "LICENSE"}
classifiers = ["License :: OSI Approved :: MIT License"]
dynamic = ["version", "description"]
dependencies = [
"pypdf",
"rich",
]
[project.scripts]
boursobank = "boursobank:main"
[project.urls]
Home = "https://git.afpy.org/mdk/boursobank"
[tool.black]

88
tests/test_parse.py Normal file
View File

@ -0,0 +1,88 @@
"""Simple non-regression tests for the Statement PDF parser.
It's possible to drop some PDF files in the test directory to run some
tests against them too.
"""
import datetime as dt
from pathlib import Path
import pytest
from boursobank import Statement, CardStatement, AccountStatement
def test_parse_header_2012():
"""Test parsing an old format of headers where the date can have a
single digit.
"""
statement = Statement.from_string(
"""
...
1/02/2012 12345 12345 00000000000 99 EUR 31/12/2011\
31/01/2012 1.000,00 0,000000 % 1
...
"""
)
assert statement.headers["date"] == dt.date(2012, 1, 1)
assert isinstance(statement, AccountStatement)
def test_parse_header_cb():
"""Test parsing a Bank Card statement header (with a bank card number in it)."""
statement = Statement.from_string(
"""
...
28/02/2024 12345 12345 00000000000 99 4810********9999 \
du 30/01/2024 au 27/02/2024 1/2
...
"""
)
assert statement.headers["date"] == dt.date(2024, 2, 1)
assert isinstance(statement, CardStatement)
def test_parse_cb_line():
"""Test parsing a CB line which contains the label AFTER the value date."""
statement = Statement.from_string(
"""
28/02/2024 12345 12345 00000000000 99 4810********9999 du \
30/01/2024 au 27/02/2024 1/2
12/02/2024 CARTE 10/02/24 PHOTOMATON 8,00
...
"""
)
assert statement.lines
assert statement.lines[0].value == -8
@pytest.mark.parametrize("pdf", list(Path(__file__).parent.glob("*.pdf")))
def test_cb_consistency_from_files(pdf):
"""Test PDF files in the tests/ directory (place them yourself, there's not)."""
statement = Statement.from_pdf(pdf)
if not isinstance(statement, CardStatement):
return
found = statement.card_debit
computed = sum(line.value for line in statement.lines)
assert (
found == computed
), f"Inconsistent total, found: {found!r}, computed: {computed!r}"
def test_old_owner():
statement = Statement.from_string(
"""
EN CAS DE PERTE OU DE VOL
- Appelez le Centre d'opposition au 09 77 40 10 08
pour une Carte VISA Classic, au 04 42 60 53 44
pour une Carte VISA Premier.
- Faites une déclaration au Commissariat de Police
- Confirmez par courrier à Boursorama Banque, Service Client
44 rue Traversiere CS 80134 92772 Boulogne-Billancourt Cedex THE OWNER IS HERE
28/02/2024 12345 12345 00000000000 99 4810********9999 du \
30/01/2024 au 27/02/2024 1/2
"""
)
assert statement.headers["card_owner"] == "THE OWNER IS HERE"