Table of Contents
Fetching ...

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Zhongxiang Sun, Xiaoxue Zang, Kai Zheng, Yang Song, Jun Xu, Xiao Zhang, Weijie Yu, Yang Song, Han Li

TL;DR

This work investigates RAG hallucinations by disentangling the LLM's use of external context from its parametric knowledge using mechanistic interpretability. It proposes ReDeEP to detect hallucinations via decoupled External Context Score $\mathcal{E}$ and Parametric Knowledge Score $\mathcal{P}$, and demonstrates that hallucinations arise from underutilization of external context by Copying Heads and overreliance on parametric knowledge by Knowledge FFNs. AARF provides a mitigation by reweighting Copying Heads and Knowledge FFNs without updating model parameters. The approach yields significant gains on RAGTruth and Dolly (AC) datasets across multiple LLaMA backbones and improves truthfulness as validated by GPT-4o evaluations, offering a practical, efficient path to reliable retrieval-augmented systems.

Abstract

Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) utilize external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when the Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual stream, while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we propose ReDeEP, a novel method that detects hallucinations by decoupling LLM's utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

TL;DR

This work investigates RAG hallucinations by disentangling the LLM's use of external context from its parametric knowledge using mechanistic interpretability. It proposes ReDeEP to detect hallucinations via decoupled External Context Score and Parametric Knowledge Score , and demonstrates that hallucinations arise from underutilization of external context by Copying Heads and overreliance on parametric knowledge by Knowledge FFNs. AARF provides a mitigation by reweighting Copying Heads and Knowledge FFNs without updating model parameters. The approach yields significant gains on RAGTruth and Dolly (AC) datasets across multiple LLaMA backbones and improves truthfulness as validated by GPT-4o evaluations, offering a practical, efficient path to reliable retrieval-augmented systems.

Abstract

Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) utilize external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when the Knowledge FFNs in LLMs overemphasize parametric knowledge in the residual stream, while Copying Heads fail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we propose ReDeEP, a novel method that detects hallucinations by decoupling LLM's utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.

Paper Structure

This paper contains 27 sections, 22 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Two examples of RAG where the retrieved document is correct but conflicts with parametric knowledge. The left example shows a correct response based on external knowledge, while the right example demonstrates hallucination despite accurate external context.
  • Figure 2: Causal perspectives on hallucination detection methods. (i): parametric knowledge is confounded by external context, (ii): external context is confounded by parametric knowledge, and (iii): mixes both without decoupling their contributions. (Ours): decouple these confounders using mechanistic interpretability, incorporating them as covariates to improve hallucination detection.
  • Figure 3: Expanded views of Unrolled LLMs' Attention and FFN blocks. (a): The calculation process of the External Context Score and Parametric Knowledge Score. (b): Example of intervening on attention heads. (c): Example of intervening on FFN modules.
  • Figure 4: Relationship Between LLM Utilization of External Context, Parametric Knowledge, and Hallucinations.Top shows the internal mechanism of LLM's utilization of external context and the occurrence of hallucinations, where the Pearson correlation coefficient between (c) and (a) is 0.41, and between (c) and (b) is 0.46, indicating correlations among them. Bottom illustrates the internal mechanism of LLM's utilization of parametric knowledge and the occurrence of hallucinations, where (d) is scaled by $1e^{7}$.
  • Figure 5: (Left) Intervention Result for Attention Heads and FFNs. (Right) External Context Scores and Parametric Knowledge Scores (scaled by $1e^{5}$) comparing Truth & Known (where LLM knows the truthful answer) and Hallucination (where LLM is unknown about the answer and hallucinated).
  • ...and 2 more figures