Table of Contents
Fetching ...

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Edward Y. Chang

TL;DR

The paper identifies that autoregressive models learn associational patterns that do not reveal interventional causal structures, yielding Rung Collapse and Aleatoric Entrenchment. It introduces Epistemic Regret Minimization (ERM), a causal belief revision objective, and a three-layer architecture that uses Physical Grounding to obtain interventional data, AGM-based updates to a dynamic causal graph, and domain-independent guards for cross-domain transfer. The authors prove convergence guarantees for ERM and asymptotic recovery of the true interventional distribution, and validate the approach with 1,360 causal-trap scenarios across six frontier LLMs, revealing persistent Rung Collapse and inverse steerability with generic corrections but substantial gains with targeted epistemic feedback. The results imply that safe AI requires explicit engagement with causal reasoning, using external causal models and epistemic signals rather than relying solely on outcome-based improvements, with practical impact for building robust tool-using agents and governance of autonomous systems.

Abstract

Machine learning systems that are "right for the wrong reasons" achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung Collapse. When outcome-based learning reinforces correct answers obtained through incorrect causal models, the agent becomes entrenched in flawed reasoning, a phenomenon we term Aleatoric Entrenchment. We propose Epistemic Regret Minimization (ERM), a belief revision objective that penalizes errors in causal reasoning independently of task success, and embed it within a three-layer architecture with three contributions grounded in knowledge representation: (1) a Physical Grounding Theorem proving that actions satisfying actuator independence implement valid do-operations, bridging action languages and do-calculus; (2) ERM as a causal belief revision operator satisfying AGM postulates, preventing entrenchment even when the agent succeeds for the wrong reasons; and (3) a failure mode taxonomy that classifies recurring reasoning errors and injects domain-independent guards, enabling cross-domain transfer. We prove asymptotic recovery of the true interventional distribution with finite-sample bounds. Experiments on 1,360 causal trap scenarios across six frontier LLMs reveal that Rung Collapse persists even in reasoning-enhanced models (3.7% for GPT-5.2), that steerability exhibits inverse scaling where advanced models resist generic correction, and that targeted ERM feedback recovers 53-59% of entrenched errors where outcome-level feedback fails.

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

TL;DR

The paper identifies that autoregressive models learn associational patterns that do not reveal interventional causal structures, yielding Rung Collapse and Aleatoric Entrenchment. It introduces Epistemic Regret Minimization (ERM), a causal belief revision objective, and a three-layer architecture that uses Physical Grounding to obtain interventional data, AGM-based updates to a dynamic causal graph, and domain-independent guards for cross-domain transfer. The authors prove convergence guarantees for ERM and asymptotic recovery of the true interventional distribution, and validate the approach with 1,360 causal-trap scenarios across six frontier LLMs, revealing persistent Rung Collapse and inverse steerability with generic corrections but substantial gains with targeted epistemic feedback. The results imply that safe AI requires explicit engagement with causal reasoning, using external causal models and epistemic signals rather than relying solely on outcome-based improvements, with practical impact for building robust tool-using agents and governance of autonomous systems.

Abstract

Machine learning systems that are "right for the wrong reasons" achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung Collapse. When outcome-based learning reinforces correct answers obtained through incorrect causal models, the agent becomes entrenched in flawed reasoning, a phenomenon we term Aleatoric Entrenchment. We propose Epistemic Regret Minimization (ERM), a belief revision objective that penalizes errors in causal reasoning independently of task success, and embed it within a three-layer architecture with three contributions grounded in knowledge representation: (1) a Physical Grounding Theorem proving that actions satisfying actuator independence implement valid do-operations, bridging action languages and do-calculus; (2) ERM as a causal belief revision operator satisfying AGM postulates, preventing entrenchment even when the agent succeeds for the wrong reasons; and (3) a failure mode taxonomy that classifies recurring reasoning errors and injects domain-independent guards, enabling cross-domain transfer. We prove asymptotic recovery of the true interventional distribution with finite-sample bounds. Experiments on 1,360 causal trap scenarios across six frontier LLMs reveal that Rung Collapse persists even in reasoning-enhanced models (3.7% for GPT-5.2), that steerability exhibits inverse scaling where advanced models resist generic correction, and that targeted ERM feedback recovers 53-59% of entrenched errors where outcome-level feedback fails.
Paper Structure (115 sections, 8 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 115 sections, 8 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms.

Key Result

Proposition 1

For any LLM $\mathcal{M}$ trained on observational corpus $\mathcal{D}$ via autoregressive loss, $\mathcal{M}$ cannot reliably compute $P(Y|\text{do}(X))$ for causal structures not explicitly represented in $\mathcal{D}$.

Figures (3)

  • Figure 1: Three-layer ERM architecture. Layer 1 revises $G_t$ (orange, domain-specific). Layer 2 updates $\mathcal{F}_t$ (red, cross-domain). Both are external to the frozen LLM. Layer 3 (routing, not shown) acts on persistent residual regret.
  • Figure 2: Rung Collapse Rates across model generations. Error bars represent 95% confidence intervals ($N=1{,}360$). The Future Frontier (GPT-5.2, blue) significantly reduces causal hallucinations, but is outperformed by the Anthropic Anomaly (Claude 3.5, red).
  • Figure 3: The Inverse Scaling of Steerability. As models transition from primitive to reasoning-heavy, their susceptibility to generic challenges plummets (gray line), revealing Epistemic Stubbornness. However, targeted Epistemic Regret Minimization (blue line) successfully breaks this entrenchment. Error bars represent 95% CIs.

Theorems & Definitions (19)

  • Definition 1: Rung Collapse
  • Proposition 1: LLM Rung Collapse
  • Definition 2: Aleatoric Success
  • Lemma 1: Modularity Under Physical Intervention
  • Theorem 1: Physical Grounding
  • Corollary 1: Confounder Immunity
  • Theorem 2: Prevention of Aleatoric Entrenchment
  • Theorem 3: Asymptotic L2 Recovery
  • proof
  • proof
  • ...and 9 more