Utilities

finamt.utils

Heuristic rule-based extraction utilities used as a fallback when the LLM is unavailable or returns incomplete data.

These functions are intentionally simple and conservative — they prefer returning None over returning plausibly wrong values.

class finamt.utils.DataExtractor[source]

Bases: object

Heuristic text extraction for receipts.

All methods are static — instantiate the class or call methods directly.

static extract_company_name(text: str) str | None[source]

Return the first non-trivial line from the top of the receipt.

Skips blank lines, lines that start with a digit (dates, amounts), and lines containing common boilerplate words.

static extract_date(text: str) datetime | None[source]

Return the first parseable date found in the text.

Handles DD.MM.YYYY, YYYY-MM-DD, DD/MM/YYYY and German month names. Two-digit years are interpreted as 2000+ if < 50, else 1900+.

static extract_amounts(text: str) dict[str, Any][source]

Extract monetary amounts from text.

Strategy: 1. Scan lines that contain a total-indicating keyword; use the

amount on that line as the grand total.

  1. Fall back to the largest amount found in the document.

Returns {"total": Decimal | None, "all": [Decimal, ...]}.

static extract_vat_info(text: str) dict[str, Decimal | None][source]

Extract the first VAT percentage + absolute amount found.

static extract_items(text: str) list[dict[str, Any]][source]

Parse individual receipt line items.

Returns a list of dicts with keys matching the LLM extraction schema so both paths feed _build_receipt_data identically.

finamt.utils.clean_json_response(response: str) str[source]

Extract and sanitise a JSON object from an LLM response string.

Handles: - Markdown code fences (`json `) - Trailing commas in objects and arrays - Unquoted keys — only attempted when the extracted candidate is not

already valid JSON, to avoid corrupting URLs or colons inside strings

Returns an empty JSON object {} on total failure so callers can always call json.loads() on the result.

finamt.utils.parse_decimal(value: Any) Decimal | None[source]

Safely coerce any value to Decimal, returning None on failure.

finamt.utils.parse_date(date_str: str) datetime | None[source]

Parse an ISO-format date string (YYYY-MM-DD) to datetime.

Also accepts common European formats as a fallback. Uses explicit format strings rather than %B/%b to avoid locale dependency. Handles English abbreviated months (JUL, AUG …) and German month names/abbreviations (OKT, MRZ, JANUAR, OKTOBER …).