Local Hybrid Retrieval-Augmented Document QA
Paolo Astrino
TL;DR
To address data privacy in enterprise QA, the paper presents a fully local Retrieval-Augmented Generation system that runs on on-premises hardware without internet access. It leverages a hybrid retrieval strategy that combines BM25 lexical matching with dense BGE embeddings, with an EnsembleRetriever tuned to 30% sparse/70% dense, achieving competitive accuracy across legal, scientific, and conversational documents. GPU acceleration yields substantial speedups in embedding and inference, and a locally hosted Llama 3.2 via Ollama ensures data sovereignty; hallucination is quantified using an LLM-as-Judge on over 1,500 query-answer pairs. The results demonstrate that privacy-preserving enterprise document QA can match cloud-based performance while keeping all data on premises.
Abstract
Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.
