An end-to-end document processing pipeline for immigration law firms: OCR ingestion of passports, national IDs, and certificates; structured field extraction via LLM; per-field confidence scoring with automatic routing (auto-accept, human review, auto-reject); provenance tracking; and a full audit trail. Built on Contractex. Select a sample document below to see the pipeline in action.
Select a document on the left and click Process Document to run the extraction pipeline.
SourceAdapter handles ingestion from uploads, cloud storage, or watched folders. PDFLoader delegates native PDFs to PyMuPDF; scanned images fall back to Tesseract OCR. Quality scoring gates further processing on configurable confidence floors.
LegalDocument carries a DocType enum (identity_doc, government_form, statute, โฆ), jurisdiction, hierarchy path, and a dictionary of extracted fields with per-field confidence scores and provenance references.
A structured extraction prompt instructs the LLM to return field values as a typed JSON object. The prompt schema is defined per document type and is provider-agnostic โ the same extraction schema works with GPT-4o, Claude, or a local Mistral model.
Routes each extracted field to AUTO_ACCEPT, HUMAN_REVIEW, or AUTO_REJECT based on configurable thresholds. High-stakes fields (passport_number, nie_number) carry tighter per-field overrides. The review queue is sorted ascending by confidence โ worst items first for reviewers.
Registers each extracted value back to its source chunk via SourceSpan (chunk ID, page, character offsets, snippet). Exact substring search is tried first; a Jaccard token-overlap fallback handles OCR noise. Coverage reports identify fields with no provenance for prioritised human review.
Extracted, routed, and provenanced data can be pushed directly to a CRM, used to fill PDF templates via a PDF automation layer, or returned as a typed API response. The AuditLogger records the full pipeline event trail for GDPR-compliant processing logs.