Legal Information AI Chatbot

A RAG-powered chatbot that continuously ingests, indexes, and answers queries from trusted U.S. legal sources — Cornell LII, Justia, FindLaw, and United States Courts. Built on Contractex, an open-source legal AI library. Every response includes source citations and provenance-tracked chunks.

Python Contractex RAG pgvector FastAPI URLLoader ProvenanceTracker AuditLogger

Data Sources

Cornell Law School LII

law.cornell.edu

12,847 indexed chunks

Justia

justia.com

8,203 indexed chunks

FindLaw

findlaw.com

15,621 indexed chunks

United States Courts

uscourts.gov

3,892 indexed chunks

Contractex Pipeline

URLLoader

↓ ETag / Last-Modified detection

SemanticChunker

↓ 512-token sliding window

Embedder → pgvector

↓ text-embedding-3-small (1536-d)

LLM (GPT-4o)

↓ informational-only system prompt

ProvenanceTracker

↓ SourceSpan citation attachment

AuditLogger

Index Stats

Total chunks40,563

Embedding dims1,536

Avg retrieval~210 ms

Last full syncToday 06:00 UTC

Ask a U.S. legal question

This chatbot retrieves information from 40,000+ indexed chunks of U.S. legal content and returns cited answers. Select a suggestion or type your own question.

How It Works

Data SourcesLII · Justia · FindLaw · US Courts

→

URLLoaderETag / Last-Modified

→

HTML Stripperstdlib parser

→

SemanticChunker512-token overlap

→

Embeddertext-embedding-3-small

→

pgvectorcosine similarity

→

LLMGPT-4o / Anthropic

→

ProvenanceTrackerSourceSpan citations

Ingestion — `URLLoader`

Fetches each source URL with conditional GET using ETag and Last-Modified headers — unchanged pages are skipped entirely. Exponential-backoff retry handles transient failures. PDF URLs are automatically delegated to PyMuPDF.

Chunking — `SemanticChunker`

Splits legal text into 512-token windows with 64-token overlap, preserving sentence boundaries. ClauseAwareChunker optionally segments by clause headers for improved precision on statutory and regulatory text.

Retrieval — `pgvector`

User queries are embedded and compared against stored vectors using cosine similarity. Top-k chunks are passed to the LLM with a system prompt that enforces informational-only responses and mandates citation of source URLs.

Citations — `ProvenanceTracker`

Attaches each retrieved chunk to the response via SourceSpan records, preserving exact page, character offsets, and source URL. Citations are verifiable against the content hash stored at ingestion time.

Sync Scheduling

A daily cron job re-runs URLLoader against every source. Content hashing means only modified pages trigger re-chunking and re-embedding — infrastructure cost scales with actual legal content changes, not page requests.

Audit — `AuditLogger`

Records every ingestion, retrieval, and generation event with UUID, UTC timestamp, and field list. Backends: append-only JSONL for development, PostgreSQL for production. Backend failures are isolated and never interrupt response delivery.

Legal Information AI Chatbot

Ask a U.S. legal question

How It Works

Ingestion — URLLoader

Chunking — SemanticChunker

Retrieval — pgvector