Skip to content
← All projects

mbox-parser

PythonChromaDBSQLiteAIClaude
Side Project

A hybrid search engine for legal documents stored in MBOX format. Combines vector search (ChromaDB) with traditional full-text search (SQLite FTS5) using Reciprocal Rank Fusion to get the best of both approaches.

How it works

Documents are parsed from MBOX files, chunked, and indexed in parallel:

  • ChromaDB stores vector embeddings for semantic search (GPU-accelerated)
  • SQLite FTS5 handles keyword and phrase matching
  • Reciprocal Rank Fusion merges both result sets into a single ranked list

Claude integration lets you ask natural language questions about the document corpus and get answers grounded in the actual source material.

Why hybrid search

Pure vector search misses exact names and case numbers. Pure keyword search misses conceptual relationships. RRF gives you both — ask "communications about the settlement timeline" and get results that match the concept even when the exact words differ, while still surfacing docs that mention specific dates and case numbers.