โ† Quiet Signals Lab / Demo

Immigration Document Automation

An end-to-end document processing pipeline for immigration law firms: OCR ingestion of passports, national IDs, and certificates; structured field extraction via LLM; per-field confidence scoring with automatic routing (auto-accept, human review, auto-reject); provenance tracking; and a full audit trail. Built on Contractex. Select a sample document below to see the pipeline in action.

Python Contractex OCR / PyMuPDF LLM Extraction ConfidenceRouter ProvenanceTracker AuditLogger LegalDocument
Sample Documents
๐Ÿ‡ช๐Ÿ‡ธ
Spanish Passport
Pasaporte Espaรฑol
Image quality 0.97
๐Ÿ‡ฎ๐Ÿ‡น
NIE Certificate
Certificado de Registro UE
Image quality 0.83
๐Ÿ‡ฎ๐Ÿ‡น
Birth Certificate
Acta de Nacimiento (trad.)
Image quality 0.64
๐Ÿ“„

Select a document on the left and click Process Document to run the extraction pipeline.

How It Works

PDF / ImageScan ยท Photo ยท Native PDF
โ†’
PyMuPDF + OCRTesseract fallback
โ†’
LegalDocumentDocType.identity_doc
โ†’
LLM ExtractorGPT-4o / Claude
โ†’
ConfidenceRouterper-field thresholds
โ†’
ProvenanceTrackerSourceSpan
โ†’
AuditLoggerJSONL / PostgreSQL
โ†’
OutputCRM / PDF fill

Document Loading

SourceAdapter handles ingestion from uploads, cloud storage, or watched folders. PDFLoader delegates native PDFs to PyMuPDF; scanned images fall back to Tesseract OCR. Quality scoring gates further processing on configurable confidence floors.

LegalDocument Model

LegalDocument carries a DocType enum (identity_doc, government_form, statute, โ€ฆ), jurisdiction, hierarchy path, and a dictionary of extracted fields with per-field confidence scores and provenance references.

LLM Field Extraction

A structured extraction prompt instructs the LLM to return field values as a typed JSON object. The prompt schema is defined per document type and is provider-agnostic โ€” the same extraction schema works with GPT-4o, Claude, or a local Mistral model.

ConfidenceRouter

Routes each extracted field to AUTO_ACCEPT, HUMAN_REVIEW, or AUTO_REJECT based on configurable thresholds. High-stakes fields (passport_number, nie_number) carry tighter per-field overrides. The review queue is sorted ascending by confidence โ€” worst items first for reviewers.

ProvenanceTracker

Registers each extracted value back to its source chunk via SourceSpan (chunk ID, page, character offsets, snippet). Exact substring search is tried first; a Jaccard token-overlap fallback handles OCR noise. Coverage reports identify fields with no provenance for prioritised human review.

Output Integrations

Extracted, routed, and provenanced data can be pushed directly to a CRM, used to fill PDF templates via a PDF automation layer, or returned as a typed API response. The AuditLogger records the full pipeline event trail for GDPR-compliant processing logs.