Table of Contents
Fetching ...

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

Praneet Suresh, Jack Stanley, Sonia Joseph, Luca Scimeca, Danilo Bzdok

TL;DR

This paper addresses transformer hallucinations by uncovering a mechanistic link between input uncertainty, internal semantic concepts, and output faithfulness. It introduces sparse autoencoders to decompose residual-stream activations into a sparse, human-interpretable concept space, then demonstrates that input noise and perturbations trigger richer concept repertoires in middle layers. A predictive pathway is built via partial least squares regression to relate concept activations to hallucination scores, and targeted suppression of high-influence concepts reduces hallucinations in experiments. Across vision and language transformers, the findings reveal a layer-wise pattern of concept recruitment and offer a practical, dataset-agnostic framework for monitoring and mitigating hallucinations, with broad implications for AI safety and alignment.

Abstract

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

From Noise to Narrative: Tracing the Origins of Hallucinations in Transformers

TL;DR

This paper addresses transformer hallucinations by uncovering a mechanistic link between input uncertainty, internal semantic concepts, and output faithfulness. It introduces sparse autoencoders to decompose residual-stream activations into a sparse, human-interpretable concept space, then demonstrates that input noise and perturbations trigger richer concept repertoires in middle layers. A predictive pathway is built via partial least squares regression to relate concept activations to hallucination scores, and targeted suppression of high-influence concepts reduces hallucinations in experiments. Across vision and language transformers, the findings reveal a layer-wise pattern of concept recruitment and offer a practical, dataset-agnostic framework for monitoring and mitigating hallucinations, with broad implications for AI safety and alignment.

Abstract

As generative AI systems become competent and democratized in science, business, and government, deeper insight into their failure modes now poses an acute need. The occasional volatility in their behavior, such as the propensity of transformer models to hallucinate, impedes trust and adoption of emerging AI solutions in high-stakes areas. In the present work, we establish how and when hallucinations arise in pre-trained transformer models through concept representations captured by sparse autoencoders, under scenarios with experimentally controlled uncertainty in the input space. Our systematic experiments reveal that the number of semantic concepts used by the transformer model grows as the input information becomes increasingly unstructured. In the face of growing uncertainty in the input space, the transformer model becomes prone to activate coherent yet input-insensitive semantic features, leading to hallucinated output. At its extreme, for pure-noise inputs, we identify a wide variety of robustly triggered and meaningful concepts in the intermediate activations of pre-trained transformer models, whose functional integrity we confirm through targeted steering. We also show that hallucinations in the output of a transformer model can be reliably predicted from the concept patterns embedded in transformer layer activations. This collection of insights on transformer internal processing mechanics has immediate consequences for aligning AI models with human values, AI safety, opening the attack surface for potential adversarial attacks, and providing a basis for automatic quantification of a model's hallucination risk.

Paper Structure

This paper contains 19 sections, 4 equations, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Workflow to examine hallucination risk in each layer of the transformer. To assess the extent to which transformer models infer meaningful concepts from semantically void inputs, we study the semanticity of concepts in SAEs trained on residual stream activations of 1.3 million Gaussian noise samples from a pre-trained CLIP vision transformer at each layer by probing the concepts with natural images from the ImageNet-1k validation set. Many concepts are highly interpretable and consistent, despite the SAEs only ever being exposed to pure noise residual stream activations during training. While this noise training setup is specific to our first experiment, subsequent experiments adopt the same concept evaluation approach with alternative transformer models and input modalities. See Preliminaries and Appendix A for further experimental details.
  • Figure 2: Semantic concepts are reliably invoked by vision transformer layer 9 activations from pure noise. Each panel shows the top 4 images among 50,000 candidate images from the ImageNet-1k validation set that most activated a particular semantic concept. These semantic concepts are defined by an SAE trained on pure noise activations from the residual stream at layer 9 of a CLIP vision transformer. Above each image we report the corresponding semantic label for that image. Patch colors indicate the individual patch activation strengths within that image for a given semantic concept (yellow = more activation of the concept). See Appendix B for additional concept examples.
  • Figure 3: Transformers fed with noise inputs lead to neuron activations with detectable and controllable semantic structure in many layers. We measure the interpretability of a concept by the semantic similarity of the labels of the top 16 images that maximally activate that concept. As a more stringent test, we also measure a concept’s steerability by the ability of that concept to causally induce its own class label when added to the residual stream of neutral input images. We find that a very large portion of our noise-derived semantic concepts are highly interpretable, and a non-negligible number of these concepts are even steerable, particularly in the early and middle layers. We report the percentage of unique concepts across our noise-trained SAE meeting the aforementioned interpretability and steerability thresholds.
  • Figure 4: Increasing uncertainty in vision or text inputs elicits more semantic structure in mid-layers of transformers. The average number of SAE concepts identified (L0) increases dramatically with increasing input perturbation. We report the average change in L0 from baseline, corresponding to the number of SAE concepts with non-zero activations, across (a) patch-shuffled image activations and (b) n-gram-shuffled text activations for each transformer layer. Error bars show one standard deviation. Smaller patches and lower n-gram count induce greater input uncertainty for images and text, respectively. For both modalities, the L0 difference between natural inputs and perturbed inputs peaks in the middle layers, and increases with increasing levels of deliberate scrambling of transformer input information.
  • Figure 5: Transformer layer activations can be used to directly predict risk of hallucinated model output. (a) Hallucination score prediction for the task of faithful summarization of 1,006 source articles. We compare Gemma 2B-IT-generated summaries against the ground truth source articles. We use the sparse SAE concept activations, derived from Gemma 2B-IT residual stream activations, as input to a PLS regression model, predicting the hallucination score for each example. We report 10-fold cross-validated coefficient of determination ($R^2$) on unseen examples, with error bars showing one standard deviation. (b) Suppressing the top 10 SAE concepts in Layer 11's residual stream, identified by the PLS model to be the primary drivers of hallucination, significantly reduces mean hallucination scores across the top quartile most hallucinated examples ($n=252$). We show a histogram of hallucination scores before (grey) and after (blue) suppression: many examples report significant reductions in hallucination, with a mean score drop of 0.19 in this subset (dashed lines).
  • ...and 11 more figures