Table of Contents
Fetching ...

H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, Maosong Sun

TL;DR

The paper reveals a microscopic basis for LLM hallucinations by identifying an ultra-sparse set of Hallucination-Associated Neurons (H-Neurons) in FFN layers that reliably predict hallucinations and generalize across tasks. Through activation-perturbation, it shows these neurons causally drive over-compliance behaviors across multiple safety-related benchmarks. By tracing origins via backward transferability and parameter drift, the work provides evidence that H-Neurons arise in pre-training and persist through alignment. These findings offer a concrete neuron-level target for detection and potential mitigation of hallucinations in LLMs.

Abstract

Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

TL;DR

The paper reveals a microscopic basis for LLM hallucinations by identifying an ultra-sparse set of Hallucination-Associated Neurons (H-Neurons) in FFN layers that reliably predict hallucinations and generalize across tasks. Through activation-perturbation, it shows these neurons causally drive over-compliance behaviors across multiple safety-related benchmarks. By tracing origins via backward transferability and parameter drift, the work provides evidence that H-Neurons arise in pre-training and persist through alignment. These findings offer a concrete neuron-level target for detection and potential mitigation of hallucinations in LLMs.

Abstract

Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.

Paper Structure

This paper contains 16 sections, 10 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Framework for identifying H-Neurons. (a) Within the Feed-Forward Network, we calculate the contribution of each neuron in one forward pass. This metric normalizes the magnitude of an individual neuron's projected output $|n(x)|$ against the layer's total output vector $|y|$, providing a standardized measure of its contribution to the hidden state. (b) The process begins by generating a balanced dataset of faithful (green check) and hallucinatory (red cross) responses using the TriviaQA benchmark. We extract the contribution profiles of neurons specifically on the answer tokens to train a linear classifier. Neurons assigned positive weights by this classifier are identified as "H-Neurons", distinguishing them from normal neurons based on their predictive role in generating hallucinations.
  • Figure 2: Illustrations for the behavioral impact of intervening on the H-Neurons. The right panel presents examples across four dimensions: Invalid Premises (hallucinating details about non-existent "cat feathers"), Compliance with Misleading Context (adopting counterfactual claims about Marie Curie), Skeptical Attitudes (abandoning a correct answer when challenged), and Harmful Instructions (bypassing safety filters to assist with weapon creation).
  • Figure 3: Compliance rate (%) of perturbed LLMs. Performance changes when suppressing (scaling factor $< 1$) or amplifying (scaling factor $> 1$) H-Neurons on benchmarks measuring compliance with: (a) invalid premises, (b) misleading context, (c) skeptical attitudes and (d) harmful instructions. Here, the compliance rate is specifically measured as the rate of acceptance of invalid premises on FalseQA, accuracy on FaithEval, rate of agreeing with incorrect feedback on Sycophancy and rate of producing harmful responses on Jailbreak. Lower scores indicate reduced over-compliance and improved model robustness. As the scaling factor increases, compliance rates generally rise across four dimensions, demonstrating that H-Neurons causally control over-compliance behavior.
  • Figure 4: (a) AUROC scores of classifiers trained on instruction-tuned models and applied directly to their corresponding base models. All models significantly outperform the random baseline. This robust transferability confirms that the neural signature of hallucination is intrinsic to the pre-training stage. (b) Distribution of H-Neuron similarity ranks. Each subplot shows the normalized rank positions (0–1 scale) of H-Neurons, with smaller normalized rank values corresponding to larger parameter changes from pre-training to alignment. Black dashed lines indicate the average rank, and colored circles represent H-Neurons. Statistical significance of the higher cosine similarity of H-Neurons compared to other neurons is verified via a one-sided t-test. Across most models, H-Neurons consistently concentrate in the higher-normalized-rank region, suggesting that these neurons are largely inherited from pre-training and are not introduced or heavily modified by SFT.