Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers
Yixiao Huang, Hanlin Zhu, Tianyu Guo, Jiantao Jiao, Somayeh Sojoudi, Michael I. Jordan, Stuart Russell, Song Mei
TL;DR
The paper identifies out-of-context reasoning (OCR) as the unified mechanism behind both generalization and hallucination when facts are injected into LLMs. It formalizes OCR as a symbolic factual-recall task and proves, via a one-layer, factorized transformer, that OCR arises from the implicit bias of gradient descent toward nuclear-norm minimization, enabling cross-concept inferences. The results show OCR can yield strong generalization when relations are causal, but induce hallucinations when relations are non-causal, with sample-efficient learning and validation on real-world data. The work provides a theoretical foundation for OCR, explains why pretraining knowledge can both help and mislead during fine-tuning, and points to future directions for mitigating OCR-related errors in deeper, multi-layer transformers while preserving desirable generalization. Overall, OCR offers a new lens to analyze and control undesired behaviors during knowledge injection in large language models.
Abstract
Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information. However, the reasons for this phenomenon remain poorly understood. In this work, we argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR): the ability to deduce implications by associating concepts, even those without a causal link. Our experiments across five prominent LLMs confirm that OCR indeed drives both generalization and hallucination, depending on whether the associated concepts are causally related. To build a rigorous theoretical understanding of this phenomenon, we then formalize OCR as a synthetic factual recall task. We empirically show that a one-layer single-head attention-only transformer with factorized output and value matrices can learn to solve this task, while a model with combined weights cannot, highlighting the crucial role of matrix factorization. Our theoretical analysis shows that the OCR capability can be attributed to the implicit bias of gradient descent, which favors solutions that minimize the nuclear norm of the combined output-value matrix. This mathematical structure explains why the model learns to associate facts and implications with high sample efficiency, regardless of whether the correlation is causal or merely spurious. Ultimately, our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
