Table of Contents
Fetching ...

Enhancing Hallucination Detection through Noise Injection

Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, Roland Memisevic

TL;DR

This work tackles the problem of hallucinations in large language models by tying detection to model uncertainty and proposing a novel source of randomness: perturbing intermediate layer activations. The Noise Enhanced Hallucination Detector (NED) combines intermediate-layer noise with temperature-based prediction-layer sampling to marginalize over two distinct randomness sources, producing stronger uncertainty signals as quantified by metrics such as $E_{answer}$ and related entropy measures. Empirically, the method improves AUROC across GSM8K, TriviaQA, CSQA, and PrOntoQA on models like Llama2-13B-chat and Mistral-7B, with robust ablations showing benefits across numbers of generations, injection layers, and architectures. The approach offers a practical, model-agnostic tool for safer LLM deployment by more reliably distinguishing true from hallucinated outputs in diverse reasoning and QA tasks.

Abstract

Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from a set of samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple and efficient approach that perturbs an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate its effectiveness across a wide range of datasets and model architectures.

Enhancing Hallucination Detection through Noise Injection

TL;DR

This work tackles the problem of hallucinations in large language models by tying detection to model uncertainty and proposing a novel source of randomness: perturbing intermediate layer activations. The Noise Enhanced Hallucination Detector (NED) combines intermediate-layer noise with temperature-based prediction-layer sampling to marginalize over two distinct randomness sources, producing stronger uncertainty signals as quantified by metrics such as and related entropy measures. Empirically, the method improves AUROC across GSM8K, TriviaQA, CSQA, and PrOntoQA on models like Llama2-13B-chat and Mistral-7B, with robust ablations showing benefits across numbers of generations, injection layers, and architectures. The approach offers a practical, model-agnostic tool for safer LLM deployment by more reliably distinguishing true from hallucinated outputs in diverse reasoning and QA tasks.

Abstract

Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from a set of samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple and efficient approach that perturbs an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate its effectiveness across a wide range of datasets and model architectures.

Paper Structure

This paper contains 27 sections, 7 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Source of Randomness in Hallucination Detection. Prior work uses prediction layer sampling and measures model uncertainty across samples for hallucination detection. Additionally, we explore noise injection that randomly perturbs intermediate representations, introducing a second source of randomness at earlier stages.
  • Figure 2: Effect of Intermediate Layer Randomness on Hallucination Detection.(a) Standalone Effect. With noise injected to randomly perturb intermediate representations, LLM exhibits greater uncertainty when hallucination (grey) compared to non-hallucination (blue); (b) Combined Effect. Injecting noise improves hallucination/non-hallucination separation, enhancing hallucination detection effectiveness. (b) Left: prediction layer sampling alone; (b) Right: noise injection and prediction layer sampling. Model uncertainty measured by Equation \ref{['eq:answer_entropy']}. A higher value indicates a higher uncertainty level. Evaluation performed on GSM8K dataset with Llama2-13B-chat model across 5 generations.
  • Figure 3: Complementary Effect of Different Randomness Sources. The x-axis presents model uncertainty with prediction layer sampling whereas the y-axis presents model uncertainty under intermediate layer noise injection. A Pearson correlation of 0.67 indicates a complementary relationship between the two sources.
  • Figure 4: Noise Injection Enhances Hallucination Detection without Degrading Model Accuracy Across Different Number of Generations. Evaluation with GSM8K datasets on Llama2-13B-chat model across 1 - 20 generations. Hallucination detection AUROC (a) and model accuracy (b) reported; higher values are better. The mean and standard deviation across random seeds are shown in the plot.
  • Figure 5: Intermediate Layer Randomness Enhances Hallucination Detection. Evaluation performed on GSM8K dataset with Llama2-13B-chat model across 10 generations. Rest of setup up follows Figure \ref{['fig:scheme-demo']} (b)
  • ...and 2 more figures