DocumentAI

Tax Document AI

OCR + LLM scanner for small-business tax filing

Built by Nicholas Falshaw · OCR + LLM for small-business bookkeeping · Production since 2025

The problem

Small businesses drown in receipts, invoices, and PDF statements. Manual categorization against a German chart of accounts (SKR03 / SKR04) takes hours every month. Generic OCR services dump raw text, forcing the accountant to re-type line items anyway.

What I built

A document intake pipeline that accepts PDFs, images, and email attachments, runs layout-aware OCR, uses an LLM to extract structured line items, validates against business rules, categorizes to the chart of accounts, and exports an accountant-ready batch — DATEV-compatible CSV or pre-filled booking PDFs.

Architecture

  • Ingestion

    Web upload, email attachment, or folder watcher; MIME detection and virus scan

  • OCR layer

    Tesseract for simple receipts, PaddleOCR for complex multi-column invoices, German-language models

  • LLM extraction

    Ollama-hosted model with structured-output prompts to emit JSON line items (date, counterparty, VAT rate, net/gross, account hint)

  • Validation

    Deterministic rules for VAT plausibility, duplicate detection, date-range checks

  • Storage

    PostgreSQL with full-text search across extracted documents

  • Export

    DATEV CSV, accountant-ready PDF summary, or direct push to a bookkeeping system

Tech stack

React 19FastAPIPython 3.11PostgreSQL 16TesseractPaddleOCROllamallava:13b

Outcome

Monthly bookkeeping prep drops from hours to minutes. Works on German receipts and invoices. Accountant receives a pre-categorized batch with confidence scores and a flag-queue for anything the pipeline couldn't auto-resolve.

Rogue AI • Production Systems •