Table of Contents
Fetching ...

Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs

Adi Simhi, Jonathan Herzig, Idan Szpektor, Yonatan Belinkov

TL;DR

This work targets the persistent problem of hallucinations in large language models by adopting a whitebox perspective and introducing WACK, a knowledge-based framework for constructing per-model benchmarks in open-book and closed-book QA. WACK automatically labels type-3 hallucinations (where the model knows the answer) and enables targeted activation-level interventions via steering vectors $d_{l,c}$ added to activations $v_{l,c}$, with a focus on how different components (MLP, Attention, Heads, Residual) and timing (pre- vs post-answer) affect mitigation. The study systematically analyzes intervention strategies, introducing dynamic, pre-answer interventions and showing that attention components generally yield the best results, while the residual stream can be detrimental unless mitigated with dynamic approaches; it also demonstrates that pre-hallucination vectors outperform post-hallucination vectors for steering and that finetuning (Goat) improves mitigation effectiveness. The findings offer practical guidelines for robust hallucination mitigation and underscore the need for multi-metric evaluation (classification, generation, and perplexity) when assessing interventions, with limitations including scope to two models and two datasets. Overall, WACK provides a principled, benchmark-driven path to understanding and reducing LLM hallucinations through carefully designed, dynamic inner-state interventions.

Abstract

Large language models (LLMs) are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing representative vectors of hallucinations vs. grounded generations, for steering the model's hidden states away from a hallucinatory state. However, common studies employ different setups and do not properly separate different possible causes of hallucinations, making interventions misguided. In this work, we introduce a method for categorizing examples based on the model's prior knowledge, named WACK. We construct WACK benchmarks that support interventions in two settings: open-book and closed-book question answering. Using the benchmarks, we perform an extensive investigation of the effect of different choices for intervention, such as the intervened components, and how often and how strongly to intervene. We find that intervention success varies depending on the component, with the attention blocks performing well and the residual stream proving detrimental to language modeling capabilities. We also show that interventions can benefit from representative vectors collected before, rather than after, a hallucination occurs. Finally, we introduce a new dynamic intervention, which intervenes only if needed, and thus is more robust than standard static interventions. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .

Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs

TL;DR

This work targets the persistent problem of hallucinations in large language models by adopting a whitebox perspective and introducing WACK, a knowledge-based framework for constructing per-model benchmarks in open-book and closed-book QA. WACK automatically labels type-3 hallucinations (where the model knows the answer) and enables targeted activation-level interventions via steering vectors added to activations , with a focus on how different components (MLP, Attention, Heads, Residual) and timing (pre- vs post-answer) affect mitigation. The study systematically analyzes intervention strategies, introducing dynamic, pre-answer interventions and showing that attention components generally yield the best results, while the residual stream can be detrimental unless mitigated with dynamic approaches; it also demonstrates that pre-hallucination vectors outperform post-hallucination vectors for steering and that finetuning (Goat) improves mitigation effectiveness. The findings offer practical guidelines for robust hallucination mitigation and underscore the need for multi-metric evaluation (classification, generation, and perplexity) when assessing interventions, with limitations including scope to two models and two datasets. Overall, WACK provides a principled, benchmark-driven path to understanding and reducing LLM hallucinations through carefully designed, dynamic inner-state interventions.

Abstract

Large language models (LLMs) are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing representative vectors of hallucinations vs. grounded generations, for steering the model's hidden states away from a hallucinatory state. However, common studies employ different setups and do not properly separate different possible causes of hallucinations, making interventions misguided. In this work, we introduce a method for categorizing examples based on the model's prior knowledge, named WACK. We construct WACK benchmarks that support interventions in two settings: open-book and closed-book question answering. Using the benchmarks, we perform an extensive investigation of the effect of different choices for intervention, such as the intervened components, and how often and how strongly to intervene. We find that intervention success varies depending on the component, with the attention blocks performing well and the residual stream proving detrimental to language modeling capabilities. We also show that interventions can benefit from representative vectors collected before, rather than after, a hallucination occurs. Finally, we introduce a new dynamic intervention, which intervenes only if needed, and thus is more robust than standard static interventions. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .
Paper Structure (32 sections, 21 figures, 8 tables)

This paper contains 32 sections, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Outline of WACK method for dataset construction.
  • Figure 2: Example from DisentQA (open-book) with both contextual and parametric answers.
  • Figure 3: Example from TriviaQA (closed-book), with the addition of good-shot or bad-shot at the beginning of the prompt and using wrong answer instead of answer in the bad-shot permutation.
  • Figure 4: Hallucination labeling in the closed-book setting. Model generations are in bold.
  • Figure 5: Methodological variability in Intervention with a list of the related papers and what variables each uses. Dynamic is a new methodology we show in this work.
  • ...and 16 more figures