Automatic Document Classifier with AI

Context

A law firm received hundreds of documents daily in various formats — scanned PDFs, certificate images, photocopied contracts, and invoices. Before any legal work could begin, each document had to be manually sorted and categorized by a team of assistants.

Problem

Manual triage consumed 4 to 6 hours of work per day, created bottlenecks at the start of cases, and produced categorization inconsistencies under high volume. Misclassified documents delayed attorneys’ work and increased the risk of errors on critical deadlines.

Solution

I built an AI pipeline that automatically processes each incoming document. The flow starts with OCR (Tesseract + image preprocessing) to extract raw text, followed by an LLM (GPT-4o-mini, fine-tuned on the firm’s own documents) that classifies the document type and extracts key metadata — case number, dates, and parties involved.

I implemented a confidence system: when the model scores above 80% confidence, classification is automatic; below that threshold, the document enters a human review queue. This threshold was calibrated to minimize false positives on critical categories (such as court orders with deadlines). Structured data is stored in PostgreSQL with original documents in S3. A continuous feedback loop uses human corrections to retrain the model monthly.

Result

After two months in production, the system processes more than 5,000 documents per day at 94% accuracy — 8x faster than the manual process. Triage cost dropped by 60%, and the document misclassification rate fell from 12% to 2%. The assistant team now focuses only on low-confidence cases, which represent less than 6% of total volume.