Immigration Document Automation — Demo

Sample Documents

🇪🇸

Spanish Passport

Pasaporte Español

Image quality 0.97

🇮🇹

NIE Certificate

Certificado de Registro UE

Image quality 0.83

🇮🇹

Birth Certificate

Acta de Nacimiento (trad.)

Image quality 0.64

📄

Select a document on the left and click Process Document to run the extraction pipeline.

How It Works

PDF / ImageScan · Photo · Native PDF

→

PyMuPDF + OCRTesseract fallback

→

LegalDocumentDocType.identity_doc

→

LLM ExtractorGPT-4o / Claude

→

ConfidenceRouterper-field thresholds

→

ProvenanceTrackerSourceSpan

→

AuditLoggerJSONL / PostgreSQL

→

OutputCRM / PDF fill

Document Loading

SourceAdapter handles ingestion from uploads, cloud storage, or watched folders. PDFLoader delegates native PDFs to PyMuPDF; scanned images fall back to Tesseract OCR. Quality scoring gates further processing on configurable confidence floors.

LegalDocument Model

LegalDocument carries a DocType enum (identity_doc, government_form, statute, …), jurisdiction, hierarchy path, and a dictionary of extracted fields with per-field confidence scores and provenance references.

LLM Field Extraction

A structured extraction prompt instructs the LLM to return field values as a typed JSON object. The prompt schema is defined per document type and is provider-agnostic — the same extraction schema works with GPT-4o, Claude, or a local Mistral model.

ConfidenceRouter

Routes each extracted field to AUTO_ACCEPT, HUMAN_REVIEW, or AUTO_REJECT based on configurable thresholds. High-stakes fields (passport_number, nie_number) carry tighter per-field overrides. The review queue is sorted ascending by confidence — worst items first for reviewers.

ProvenanceTracker

Registers each extracted value back to its source chunk via SourceSpan (chunk ID, page, character offsets, snippet). Exact substring search is tried first; a Jaccard token-overlap fallback handles OCR noise. Coverage reports identify fields with no provenance for prioritised human review.

Output Integrations

Extracted, routed, and provenanced data can be pushed directly to a CRM, used to fill PDF templates via a PDF automation layer, or returned as a typed API response. The AuditLogger records the full pipeline event trail for GDPR-compliant processing logs.