Table of Contents
Fetching ...

Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative Analysis

Chandana Sree Mala, Gizem Gezici, Fosca Giannotti

TL;DR

The paper addresses hallucination in large language models by leveraging Retrieval-Augmented Generation (RAG) with three retriever types: sparse, dense, and a hybrid. It introduces a hybrid retriever with query expansion via WordNet and dynamic weighting between sparse and dense signals, fused with Reciprocal Rank Fusion, and evaluates on the HaluBench dataset using $MAP@3$ and $NDCG@3$ metrics. The hybrid approach achieves $MAP@3=0.897$ and $NDCG@3=0.915$, and delivers the highest downstream accuracy on fails while reducing hallucination and rejection rates to $Accuracy=80.41\%$, $Hallucination Rate=9.38\%$, and $Rejection Rate=10.36\%$. Across six source datasets, Hyb-RRF consistently improves grounding and reduces hallucinations, though domain-specific challenges remain. This work highlights the practical impact of integrated retrieval design on LLM reliability in real-world QA settings.

Abstract

Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever's ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.

Hybrid Retrieval for Hallucination Mitigation in Large Language Models: A Comparative Analysis

TL;DR

The paper addresses hallucination in large language models by leveraging Retrieval-Augmented Generation (RAG) with three retriever types: sparse, dense, and a hybrid. It introduces a hybrid retriever with query expansion via WordNet and dynamic weighting between sparse and dense signals, fused with Reciprocal Rank Fusion, and evaluates on the HaluBench dataset using and metrics. The hybrid approach achieves and , and delivers the highest downstream accuracy on fails while reducing hallucination and rejection rates to , , and . Across six source datasets, Hyb-RRF consistently improves grounding and reduces hallucinations, though domain-specific challenges remain. This work highlights the practical impact of integrated retrieval design on LLM reliability in real-world QA settings.

Abstract

Large Language Models (LLMs) excel in language comprehension and generation but are prone to hallucinations, producing factually incorrect or unsupported outputs. Retrieval Augmented Generation (RAG) systems address this issue by grounding LLM responses with external knowledge. This study evaluates the relationship between retriever effectiveness and hallucination reduction in LLMs using three retrieval approaches: sparse retrieval based on BM25 keyword search, dense retrieval using semantic search with Sentence Transformers, and a proposed hybrid retrieval module. The hybrid module incorporates query expansion and combines the results of sparse and dense retrievers through a dynamically weighted Reciprocal Rank Fusion score. Using the HaluBench dataset, a benchmark for hallucinations in question answering tasks, we assess retrieval performance with metrics such as mean average precision and normalised discounted cumulative gain, focusing on the relevance of the top three retrieved documents. Results show that the hybrid retriever achieves better relevance scores, outperforming both sparse and dense retrievers. Further evaluation of LLM-generated answers against ground truth using metrics such as accuracy, hallucination rate, and rejection rate reveals that the hybrid retriever achieves the highest accuracy on fails, the lowest hallucination rate, and the lowest rejection rate. These findings highlight the hybrid retriever's ability to enhance retrieval relevance, reduce hallucination rates, and improve LLM reliability, emphasising the importance of advanced retrieval techniques in mitigating hallucinations and improving response accuracy.

Paper Structure

This paper contains 20 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Our Hybrid Retriever Pipeline
  • Figure 2: Generation phase
  • Figure 3: Overall Performance in Mitigating Hallucinations on $HaluBench_{small}$
  • Figure 4: Metrics comparison on only Hallucinated Samples
  • Figure 5: This is an example caption for the image.
  • ...and 2 more figures