Enhancing Hallucination Detection through Noise Injection
Litian Liu, Reza Pourreza, Sunny Panchal, Apratim Bhattacharyya, Yao Qin, Roland Memisevic
TL;DR
This work tackles the problem of hallucinations in large language models by tying detection to model uncertainty and proposing a novel source of randomness: perturbing intermediate layer activations. The Noise Enhanced Hallucination Detector (NED) combines intermediate-layer noise with temperature-based prediction-layer sampling to marginalize over two distinct randomness sources, producing stronger uncertainty signals as quantified by metrics such as $E_{answer}$ and related entropy measures. Empirically, the method improves AUROC across GSM8K, TriviaQA, CSQA, and PrOntoQA on models like Llama2-13B-chat and Mistral-7B, with robust ablations showing benefits across numbers of generations, injection layers, and architectures. The approach offers a practical, model-agnostic tool for safer LLM deployment by more reliably distinguishing true from hallucinated outputs in diverse reasoning and QA tasks.
Abstract
Large Language Models (LLMs) are prone to generating plausible yet incorrect responses, known as hallucinations. Effectively detecting hallucinations is therefore crucial for the safe deployment of LLMs. Recent research has linked hallucinations to model uncertainty, suggesting that hallucinations can be detected by measuring dispersion over answer distributions obtained from a set of samples drawn from a model. While drawing from the distribution over tokens defined by the model is a natural way to obtain samples, in this work, we argue that it is sub-optimal for the purpose of detecting hallucinations. We show that detection can be improved significantly by taking into account model uncertainty in the Bayesian sense. To this end, we propose a very simple and efficient approach that perturbs an appropriate subset of model parameters, or equivalently hidden unit activations, during sampling. We demonstrate its effectiveness across a wide range of datasets and model architectures.
