Table of Contents
Fetching ...

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei

TL;DR

The paper identifies out-of-context reasoning (OCR) as the unified mechanism behind both generalization and hallucination when facts are injected into LLMs. It formalizes OCR as a symbolic factual-recall task and proves, via a one-layer, factorized transformer, that OCR arises from the implicit bias of gradient descent toward nuclear-norm minimization, enabling cross-concept inferences. The results show OCR can yield strong generalization when relations are causal, but induce hallucinations when relations are non-causal, with sample-efficient learning and validation on real-world data. The work provides a theoretical foundation for OCR, explains why pretraining knowledge can both help and mislead during fine-tuning, and points to future directions for mitigating OCR-related errors in deeper, multi-layer transformers while preserving desirable generalization. Overall, OCR offers a new lens to analyze and control undesired behaviors during knowledge injection in large language models.

Abstract

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

TL;DR

The paper identifies out-of-context reasoning (OCR) as the unified mechanism behind both generalization and hallucination when facts are injected into LLMs. It formalizes OCR as a symbolic factual-recall task and proves, via a one-layer, factorized transformer, that OCR arises from the implicit bias of gradient descent toward nuclear-norm minimization, enabling cross-concept inferences. The results show OCR can yield strong generalization when relations are causal, but induce hallucinations when relations are non-causal, with sample-efficient learning and validation on real-world data. The work provides a theoretical foundation for OCR, explains why pretraining knowledge can both help and mislead during fine-tuning, and points to future directions for mitigating OCR-related errors in deeper, multi-layer transformers while preserving desirable generalization. Overall, OCR offers a new lens to analyze and control undesired behaviors during knowledge injection in large language models.

Abstract

Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.

Paper Structure

This paper contains 52 sections, 23 theorems, 173 equations, 7 figures, 3 tables.

Key Result

Proposition 1

Suppose $d_h \ge d$. The factorized parameterization ${\boldsymbol{\theta}} = (\bm{W}_\mathsf{O}, \bm{W}_\mathsf{V})$ with $\bm{W}_\mathsf{O},\bm{W}_\mathsf{V} \in \mathbb{R}^{d \times d_h}$ has equivalent expressive power to the non-factorized parameterization $\tilde{{\boldsymbol{\theta}}} = \bm{W

Figures (7)

  • Figure 1: Illustration of the symbolic out-of-context reasoning (OCR) task.Left: The task is motivated by real-world knowledge injection, where $\mathcal{S}$ corresponds to names and $\mathcal{A}_1 = \{b_i\}_{i = 1}^n, \mathcal{A}_2= \{c_i\}_{i = 1}^n$ denote collections of cities and languages, respectively. Middle: We tokenize entities into symbolic sequences. Right: The mapping rule connects $\mathcal{S}$, $\mathcal{A}_1$, and $\mathcal{A}_2$, where each $s\in\mathcal{S}_i$ associates with a unique fact $b_i\in\mathcal{A}_1$ and corresponding implication $c_i\in\mathcal{A}_2$.
  • Figure 2: The weights and mechanisms of the trained one-layer attention models. The heatmaps on the left show that the factorized model (bottom) learns a structured weight matrix that enables OCR, as highlighted by the red box. The non-factorized model (top) fails to learn this structure. Here, the weights shown are the partial weights in the output-value matrix related to the prediction, i.e., we show a reduced matrix $\bm{W}_{\mathsf{O}\mathsf{V}} \in \mathbb{R}^{|\mathcal{A}| \times (mn + 2)}$. The diagram on the right illustrates how this structural difference leads to different outcomes. The task is to predict $c_2 \in \mathcal{A}_2$ given input $z_{1:T}$ with $(s_2, r_2)$, where the atomic knowledge $(s_2, r_2, c_2)$ is not included in the training set.
  • Figure 3: Training and Test Implication Loss for Factorized vs. Non-Factorized Models. While both models effectively minimize the training loss (left), their performance on unseen test implications differs starkly (right). The factorized model successfully generalizes, achieving low test implication loss and thus demonstrating OCR, while the non-factorized model fails to generalize.
  • Figure 4: Comparison of full weights of trained one-layer linear attention models.Left: Non-factorized model. Right: Factorized model. The factorized model shows strong OCR capability compared to the non-factorized model.
  • Figure 5: Comparison of solutions to \ref{['eq:w-svm']} and \ref{['eq:ov-svm']}.Top Left: \ref{['eq:w-svm']} with the Frobenius norm objective. Bottom Left: \ref{['eq:ov-svm']} with the nuclear norm objective. Here we only show the partial weights in the output-value matrix related to the prediction, i.e., $\bm{W}_{\mathsf{O}\mathsf{V}} \in \mathbb{R}^{|\mathcal{A}| \times (mn + 2)}$. Right: Geometric interpretation of $\bm{W}_\mathsf{O}$ and $\bm{W}_\mathsf{V}$ solved in \ref{['eq:ov-svm']}. All the subjects' feature vectors (corresponding to rows in $\bm{W}_\mathsf{V}$) reside in the $xy$ plane while the relation vectors corresponding to $r_1$ and $r_2$ are orthogonal to the subjects and point in opposite directions. The predictions $\widehat{\bm{W}_\mathsf{O}}(b_i), \widehat{\bm{W}_\mathsf{O}}(c_i)$ are made by summing up the feature vector of $s_i$ with $r_1$ or $r_2$, which aligns well with the features of $b_i$ or $c_i$ respectively (plotted in the figure, which are corresponding rows in $\bm{W}_\mathsf{O}$) with cosine similarity greater than $0.9$.
  • ...and 2 more figures

Theorems & Definitions (46)

  • Proposition 1: Equivalent expressivity for $(\bm{W}_\mathsf{O}, \bm{W}_\mathsf{V})$ and $\bm{W}_{\mathsf{O}\mathsf{V}}$
  • Remark 1
  • Theorem 1: SVM forms
  • Theorem 2: The OCR abilities of the factorized and non-factorized models
  • Theorem 3
  • proof : Proof of \ref{['prop:expressivity']}
  • proof : Proof of \ref{['thm:svm']}
  • Remark 2
  • Lemma 1
  • proof
  • ...and 36 more