Table of Contents
Fetching ...

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona

TL;DR

The paper tackles why large language models hallucinate by attributing failures to intrinsic architectural dynamics. It introduces Distributional Semantics Tracing (DST), a unified, layerwise framework that builds a causal semantic network and a Distributional Semantics Strength (DSS) metric to quantify the coherence of the contextual pathway and predict hallucinations. A dual-pathway (Associative vs Contextual) mechanism explains why fast, surface-level associations hijack slow, contextual reasoning, with a measurable instance of Reasoning Shortcut Hijack and a strong negative correlation ($\rho = -0.863$) between DSS and hallucination rate. Empirical results on Racing Thoughts and HALoGEN show DST yields higher faithfulness than baselines ($\text{avg faithfulness} \approx 0.71$–$0.79$) and supports proactive interventions to improve reliability in transformers.

Abstract

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions. First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate, contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation ($ρ= -0.863$) with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

TL;DR

The paper tackles why large language models hallucinate by attributing failures to intrinsic architectural dynamics. It introduces Distributional Semantics Tracing (DST), a unified, layerwise framework that builds a causal semantic network and a Distributional Semantics Strength (DSS) metric to quantify the coherence of the contextual pathway and predict hallucinations. A dual-pathway (Associative vs Contextual) mechanism explains why fast, surface-level associations hijack slow, contextual reasoning, with a measurable instance of Reasoning Shortcut Hijack and a strong negative correlation () between DSS and hallucination rate. Empirical results on Racing Thoughts and HALoGEN show DST yields higher faithfulness than baselines () and supports proactive interventions to improve reliability in transformers.

Abstract

Large Language Models (LLMs) are prone to hallucination, the generation of plausible yet factually incorrect statements. This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions. First, to enable the reliable tracing of internal semantic failures, we propose Distributional Semantics Tracing (DST), a unified framework that integrates established interpretability techniques to produce a causal map of a model's reasoning, treating meaning as a function of context (distributional semantics). Second, we pinpoint the model's layer at which a hallucination becomes inevitable, identifying a specific commitment layer where a model's internal representations irreversibly diverge from factuality. Third, we identify the underlying mechanism for these failures. We observe a conflict between distinct computational pathways, which we interpret using the lens of dual-process theory: a fast, heuristic associative pathway (akin to System 1) and a slow, deliberate, contextual pathway (akin to System 2), leading to predictable failure modes such as Reasoning Shortcut Hijacks. Our framework's ability to quantify the coherence of the contextual pathway reveals a strong negative correlation () with hallucination rates, implying that these failures are predictable consequences of internal semantic weakness. The result is a mechanistic account of how, when, and why hallucinations occur within the Transformer architecture.

Paper Structure

This paper contains 41 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: A layer-wise view of how a hallucination unfolds inside a Large Language Model. The graph tracks the model's (Olmo 2 olmo20252olmo2furious) confidence, identifying three critical stages: the prediction onset (green dot), the semantic inversion point (yellow dot), and the commitment layer (red dot), an irreversible point of no return. The expansion diagrams for each of the above stages are semantic networks that visualise this process. This paper introduces Distributional Semantics Tracing (Section \ref{['TSF']}) to trace this semantic drift from its origin to the final architectural failure.
  • Figure 2: The Distributional Semantics Tracing (DST) framework. It integrates signals from concept importance, patched representations, and subsequence tracing to build a semantic network that reveals the conceptual relationships driving a prediction.
  • Figure 3: Distributional Semantics Tracing (DST) exposes the final-layer semantic network for (a) a correct response and (b) a hallucination, where a spurious association corrupts the reasoning.
  • Figure 4: Layer-wise analysis of a reasoning failure for Qwen 3 qwen2024qwen2.5. This shows the progression from prediction onset (green) to the semantic inversion point (yellow) and the irreversible commitment layer (red).
  • Figure 5: The relationship between a model's internal reasoning coherence and its tendency to hallucinate. The x-axis shows the Distributional Semantics Strength (DSS), a metric we introduce to quantify the stability of a model's contextual pathway. The y-axis shows the corresponding Hallucination Rate. The data reveals a strong negative linear relationship, with a Pearson correlation coefficient ($\rho$) of -0.863 and an R-squared value of 0.746. This provides strong quantitative evidence that a less coherent contextual pathway (lower DSS) is a primary and predictable vulnerability leading to higher rates of hallucination.
  • ...and 3 more figures