Table of Contents
Fetching ...

The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States

Fabian Ridder, Malte Schilling

TL;DR

The paper proposes HalluRAG, a dataset designed to study closed-domain hallucinations in retrieval-augmented generation by leveraging recency constraints to ensure information is not seen during training. It trains MLP classifiers on LLM internal states (CEV and IAV) to detect sentence-level hallucinations, with auto-labeling performed by GPT-4o and a four-boolean grounding scheme. Results show moderate to strong detection performance across models and quantizations, particularly when separating answerable and unanswerable prompts, and highlight generalization gaps across datasets, underscoring the need for more diverse data. Overall, HalluRAG demonstrates that internal representations contain actionable signals for hallucination detection in RAG systems, but practical deployment requires broader datasets and robust cross-domain evaluation.

Abstract

Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.

The HalluRAG Dataset: Detecting Closed-Domain Hallucinations in RAG Applications Using an LLM's Internal States

TL;DR

The paper proposes HalluRAG, a dataset designed to study closed-domain hallucinations in retrieval-augmented generation by leveraging recency constraints to ensure information is not seen during training. It trains MLP classifiers on LLM internal states (CEV and IAV) to detect sentence-level hallucinations, with auto-labeling performed by GPT-4o and a four-boolean grounding scheme. Results show moderate to strong detection performance across models and quantizations, particularly when separating answerable and unanswerable prompts, and highlight generalization gaps across datasets, underscoring the need for more diverse data. Overall, HalluRAG demonstrates that internal representations contain actionable signals for hallucination detection in RAG systems, but practical deployment requires broader datasets and robust cross-domain evaluation.

Abstract

Detecting hallucinations in large language models (LLMs) is critical for enhancing their reliability and trustworthiness. Most research focuses on hallucinations as deviations from information seen during training. However, the opaque nature of an LLM's parametric knowledge complicates the understanding of why generated texts appear ungrounded: The LLM might not have picked up the necessary knowledge from large and often inaccessible datasets, or the information might have been changed or contradicted during further training. Our focus is on hallucinations involving information not used in training, which we determine by using recency to ensure the information emerged after a cut-off date. This study investigates these hallucinations by detecting them at sentence level using different internal states of various LLMs. We present HalluRAG, a dataset designed to train classifiers on these hallucinations. Depending on the model and quantization, MLPs trained on HalluRAG detect hallucinations with test accuracies ranging up to 75 %, with Mistral-7B-Instruct-v0.1 achieving the highest test accuracies. Our results show that IAVs detect hallucinations as effectively as CEVs and reveal that answerable and unanswerable prompts are encoded differently as separate classifiers for these categories improved accuracy. However, HalluRAG showed some limited generalizability, advocating for more diversity in datasets on hallucinations.

Paper Structure

This paper contains 18 sections, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Differentiation of approaches azaria2023internalsu2024unsupervisedlongpre2022entitybased based on the type of queried knowledge. Current methods in the literature focus on knowledge that is assumed as entrained into the LLM's parameters (parametric). This is often difficult to assess as, first, the training data is not always accessible. Second, it is not clear if and how this information was accurately learned by the model during training, for example, which level of detail was kept. Or how further training on other data might have influenced the information, for example, when contradicting pieces of information were present in the training data. In contrast, our focus (in the HalluRAG dataset) is on knowledge the LLM could not have seen during training, avoiding speculative assumptions. This gives us full control over offering this knowledge as context to the model or dealing with questions the model can not answer in any case. The second dimension distinguishes if relevant information for answering the question was provided as part of the context.
  • Figure 2: Locations of intermediate activation values and contextualized embedding vectors in the simplified architecture of LLaMA-2-7B including RMSNorm zhang2019root and Rotary Position Embeddings su2023roformer. While azaria2023internal and su2024unsupervised used contextualized embedding vectors as input to a binary classifier, we extend this approach by also considering intermediate activation values as classifier inputs.
  • Figure 3: Overview of process flow for setting up the HalluRAG dataset: Shown is the whole process of a valid passage on Wikipedia turned into a RAG prompt, the corresponding generated sentences, and internal states for the HalluRAG dataset to eventually train a multilayer perceptron. For more details see text.