Table of Contents
Fetching ...

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

Guangzhi Xiong, Zhenghao He, Bohan Liu, Sanchit Sinha, Aidong Zhang

TL;DR

RAGLens leverages sparse autoencoders to extract interpretable internal features from LLM activations that correlate with RAG hallucinations. By selecting informative features through mutual information and modeling them with a Generalized Additive Model, it delivers accurate, explanation-friendly detection and supports post-hoc mitigation. Across diverse datasets, models, and domains, RAGLens outperforms baselines, shows cross-model transferability, and provides both local and global interpretability to guide faithful generation. This approach highlights the value of mechanistic interpretability for improving trustworthiness in retrieval-augmented systems.

Abstract

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.

Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders

TL;DR

RAGLens leverages sparse autoencoders to extract interpretable internal features from LLM activations that correlate with RAG hallucinations. By selecting informative features through mutual information and modeling them with a Generalized Additive Model, it delivers accurate, explanation-friendly detection and supports post-hoc mitigation. Across diverse datasets, models, and domains, RAGLens outperforms baselines, shows cross-model transferability, and provides both local and global interpretability to guide faithful generation. This approach highlights the value of mechanistic interpretability for improving trustworthiness in retrieval-augmented systems.

Abstract

Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.

Paper Structure

This paper contains 47 sections, 2 theorems, 31 equations, 10 figures, 13 tables.

Key Result

Theorem 1

If $T\times \bar{p} \ll 1$ with $\bar{p}=\tfrac{1}{2}(p_1+p_0)$, then where $I(\bar{z};\ell)>0$ iff $p_1\neq p_0$. The leading dependence is linear in $T$ and quadratic in $\Delta p$.

Figures (10)

  • Figure 1: Overview of RAGLens for detecting, explaining, and mitigating hallucinations in retrieval-augmented generation using interpretable sparse features.
  • Figure 2: Comparison of LLM CoT-style self-judgment versus internal knowledge revealed by SAE features for hallucination detection across datasets.
  • Figure 3: Layer-wise analysis of Llama3.2-1B, Llama3-8B, Qwen3-0.6B, and Qwen3-4B on subtasks in RAGTruth (RAGTruth-Summary, RAGTruth-QA, and RAGTruth-Data2txt).
  • Figure 4: Effect of varying the number of selected features ($K'$) on hallucination detection performance, comparing mutual information (MI) ranking and random selection (Rand.).
  • Figure 5: Comparison of logistic regression (LR) and generalized additive model (GAM), multilayer perceptron (MLP), and eXtreme Gradient Boosting (XGBoost) as predictors for RAGLens, evaluated across multiple models and datasets.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 1: Max pooling in the sparse-activation regime
  • proof : Proof sketch
  • Theorem 2: Restatement of Theorem \ref{['thm:maxpool-sparse']}
  • proof