Table of Contents
Fetching ...

Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

Yiyou Sun, Yu Gai, Lijie Chen, Abhilasha Ravichander, Yejin Choi, Dawn Song

TL;DR

This work addresses the pervasive issue of hallucinations in large language models by introducing a subsequence association framework that traces outputs to triggering input subsequences. It shows decoder-only transformers encode subsequence embeddings and map these via linear components to next-token logits, enabling a cross-context causal tracing approach. The authors develop a reproducibility-focused tracing algorithm (SAT) that identifies dominant triggering subsequences by sampling diversified inputs and using beam search, and demonstrate that SAT outperforms standard attribution methods while correlating with training-corpus statistics (Dolma/Dolma-1.7). The results establish a unified lens on hallucinations, offer a practical tracing tool, and reveal that both small and large models rely on specific subsequences that reflect their training data, with potential implications for mitigation and debugging.

Abstract

Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.

Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

TL;DR

This work addresses the pervasive issue of hallucinations in large language models by introducing a subsequence association framework that traces outputs to triggering input subsequences. It shows decoder-only transformers encode subsequence embeddings and map these via linear components to next-token logits, enabling a cross-context causal tracing approach. The authors develop a reproducibility-focused tracing algorithm (SAT) that identifies dominant triggering subsequences by sampling diversified inputs and using beam search, and demonstrate that SAT outperforms standard attribution methods while correlating with training-corpus statistics (Dolma/Dolma-1.7). The results establish a unified lens on hallucinations, offer a practical tracing tool, and reveal that both small and large models rely on specific subsequences that reflect their training data, with potential implications for mitigation and debugging.

Abstract

Large language models (LLMs) frequently generate hallucinations-content that deviates from factual accuracy or provided context-posing challenges for diagnosis due to the complex interplay of underlying causes. This paper introduces a subsequence association framework to systematically trace and understand hallucinations. Our key insight is that hallucinations arise when dominant hallucinatory associations outweigh faithful ones. Through theoretical and empirical analyses, we demonstrate that decoder-only transformers effectively function as subsequence embedding models, with linear layers encoding input-output associations. We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts. Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model's training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.

Paper Structure

This paper contains 37 sections, 11 theorems, 123 equations, 9 figures, 4 tables, 2 algorithms.

Key Result

Proposition 2.3

(Simplified form) Under the independence assumption for subsequences appearing in $\mathbf{s}'$, we have:

Figures (9)

  • Figure 1: Hallucination examples and evidence of subsequence association in Olmo-7Bolmo2024. (a) An example of a hallucination. The red blocks highlight hallucinated outputs with singers who were not born in New York. (b) Additional examples illustrating subsequence associations (using $\Psi$ as a measure). The probability of each singer’s appearance changes with new sentences with special subsequence. These subsequence associations can be traced back to the training corpus, Dolmadolma.
  • Figure 2: Examples of two hallucination cases of GPT-4o with simplified input/output provided here. Full screenshots are included in Appendix \ref{['sec:sup_example']}. The right side of the examples displays the results of two generations. The faithful answers to the examples are Elvis Crespo (left) and his niece (right) respectively.
  • Figure 3: Illustration of subsequence embedding procedure via transformer blocks and the encoding of associations within matrix parameters. Orange boxes represent embeddings for individual tokens and subsequences. Blue boxes denote transformer blocks, each comprising Multi-Head Attention (MHA) and Feed-Forward Networks (FFN). The weight matrix is highlighted in yellow boxes. Grey lines depict the flow of token information used to form the subsequence embedding for (sin, ger, New, York, El).
  • Figure 4: Reproducibility rate of the target hallucination subsequence in the output across four input distributions, evaluated at three different subsequence length ratios.
  • Figure 5: Examples of input prompts to GPT-4o-0806 along with the trigger subsequences identified by our tracing algorithm (SAT), and the corresponding reproducible hallucinated outputs in the settings of $\text{bert}, \text{rand}, \text{gpt-m},$ and $\text{gpt-t}$, respectively. (Results are obtained via API calls; web interface may invoke web search.)
  • ...and 4 more figures

Theorems & Definitions (21)

  • Definition 2.1: Subsequence Relation ($\sqsubseteq$)
  • Definition 2.2: Subsequence Association
  • Proposition 2.3
  • Theorem 2.4: Subsequence Embedding with Transformer
  • Proposition 2.5
  • Proposition 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • ...and 11 more