Table of Contents
Fetching ...

Local Hybrid Retrieval-Augmented Document QA

Paolo Astrino

TL;DR

To address data privacy in enterprise QA, the paper presents a fully local Retrieval-Augmented Generation system that runs on on-premises hardware without internet access. It leverages a hybrid retrieval strategy that combines BM25 lexical matching with dense BGE embeddings, with an EnsembleRetriever tuned to 30% sparse/70% dense, achieving competitive accuracy across legal, scientific, and conversational documents. GPU acceleration yields substantial speedups in embedding and inference, and a locally hosted Llama 3.2 via Ollama ensures data sovereignty; hallucination is quantified using an LLM-as-Judge on over 1,500 query-answer pairs. The results demonstrate that privacy-preserving enterprise document QA can match cloud-based performance while keeping all data on premises.

Abstract

Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.

Local Hybrid Retrieval-Augmented Document QA

TL;DR

To address data privacy in enterprise QA, the paper presents a fully local Retrieval-Augmented Generation system that runs on on-premises hardware without internet access. It leverages a hybrid retrieval strategy that combines BM25 lexical matching with dense BGE embeddings, with an EnsembleRetriever tuned to 30% sparse/70% dense, achieving competitive accuracy across legal, scientific, and conversational documents. GPU acceleration yields substantial speedups in embedding and inference, and a locally hosted Llama 3.2 via Ollama ensures data sovereignty; hallucination is quantified using an LLM-as-Judge on over 1,500 query-answer pairs. The results demonstrate that privacy-preserving enterprise document QA can match cloud-based performance while keeping all data on premises.

Abstract

Organizations handling sensitive documents face a critical dilemma: adopt cloud-based AI systems that offer powerful question-answering capabilities but compromise data privacy, or maintain local processing that ensures security but delivers poor accuracy. We present a question-answering system that resolves this trade-off by combining semantic understanding with keyword precision, operating entirely on local infrastructure without internet access. Our approach demonstrates that organizations can achieve competitive accuracy on complex queries across legal, scientific, and conversational documents while keeping all data on their machines. By balancing two complementary retrieval strategies and using consumer-grade hardware acceleration, the system delivers reliable answers with minimal errors, letting banks, hospitals, and law firms adopt conversational document AI without transmitting proprietary information to external providers. This work establishes that privacy and performance need not be mutually exclusive in enterprise AI deployment.

Paper Structure

This paper contains 34 sections, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture: frontend UI, client API, and server (retrieval, RAG core, secure credentials) with isolated secrets and local processing.
  • Figure 2: Reliability metrics: (a) low hallucination with high faithfulness/confidence; (b) distributions concentrated at 5 with modest degradation on MS MARCO.
  • Figure 3: Hybrid weight sensitivity across datasets. Each panel summarizes retrieval quality vs sparse weight (10%--100%): composite plots include MRR, Recall@K, answer coverage, and rank / degradation curves. The 30% sparse / 70% dense configuration achieves near-optimal balance across all datasets; increasing sparsity causes sharp degradation for MS MARCO, gradual decline for SQuAD, and modest impact for Natural Questions.
  • Figure 4: Benchmark positioning: top three hybrid weights vs tier bands (MS MARCO normalized, SQuAD absolute). Chosen 30/70 mix sits solidly in Competitive while retaining acceptable MS MARCO performance.