A RAG-powered chatbot that continuously ingests, indexes, and answers queries from trusted U.S. legal sources — Cornell LII, Justia, FindLaw, and United States Courts. Built on Contractex, an open-source legal AI library. Every response includes source citations and provenance-tracked chunks.
URLLoaderFetches each source URL with conditional GET using ETag and Last-Modified headers — unchanged pages are skipped entirely. Exponential-backoff retry handles transient failures. PDF URLs are automatically delegated to PyMuPDF.
SemanticChunkerSplits legal text into 512-token windows with 64-token overlap, preserving sentence boundaries. ClauseAwareChunker optionally segments by clause headers for improved precision on statutory and regulatory text.
pgvectorUser queries are embedded and compared against stored vectors using cosine similarity. Top-k chunks are passed to the LLM with a system prompt that enforces informational-only responses and mandates citation of source URLs.
ProvenanceTrackerAttaches each retrieved chunk to the response via SourceSpan records, preserving exact page, character offsets, and source URL. Citations are verifiable against the content hash stored at ingestion time.
A daily cron job re-runs URLLoader against every source. Content hashing means only modified pages trigger re-chunking and re-embedding — infrastructure cost scales with actual legal content changes, not page requests.
AuditLoggerRecords every ingestion, retrieval, and generation event with UUID, UTC timestamp, and field list. Backends: append-only JSONL for development, PostgreSQL for production. Backend failures are isolated and never interrupt response delivery.