Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Edward Y. Chang

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Edward Y. Chang

TL;DR

The paper identifies that autoregressive models learn associational patterns that do not reveal interventional causal structures, yielding Rung Collapse and Aleatoric Entrenchment. It introduces Epistemic Regret Minimization (ERM), a causal belief revision objective, and a three-layer architecture that uses Physical Grounding to obtain interventional data, AGM-based updates to a dynamic causal graph, and domain-independent guards for cross-domain transfer. The authors prove convergence guarantees for ERM and asymptotic recovery of the true interventional distribution, and validate the approach with 1,360 causal-trap scenarios across six frontier LLMs, revealing persistent Rung Collapse and inverse steerability with generic corrections but substantial gains with targeted epistemic feedback. The results imply that safe AI requires explicit engagement with causal reasoning, using external causal models and epistemic signals rather than relying solely on outcome-based improvements, with practical impact for building robust tool-using agents and governance of autonomous systems.

Abstract

Machine learning systems that are "right for the wrong reasons" achieve high performance through shortcuts that collapse under distributional shift. We show this pathology has a precise causal origin: autoregressive training provides no gradient signal to distinguish association P(Y|X) from intervention P(Y|do(X)), a failure we formalize as Rung Collapse. When outcome-based learning reinforces correct answers obtained through incorrect causal models, the agent becomes entrenched in flawed reasoning, a phenomenon we term Aleatoric Entrenchment. We propose Epistemic Regret Minimization (ERM), a belief revision objective that penalizes errors in causal reasoning independently of task success, and embed it within a three-layer architecture with three contributions grounded in knowledge representation: (1) a Physical Grounding Theorem proving that actions satisfying actuator independence implement valid do-operations, bridging action languages and do-calculus; (2) ERM as a causal belief revision operator satisfying AGM postulates, preventing entrenchment even when the agent succeeds for the wrong reasons; and (3) a failure mode taxonomy that classifies recurring reasoning errors and injects domain-independent guards, enabling cross-domain transfer. We prove asymptotic recovery of the true interventional distribution with finite-sample bounds. Experiments on 1,360 causal trap scenarios across six frontier LLMs reveal that Rung Collapse persists even in reasoning-enhanced models (3.7% for GPT-5.2), that steerability exhibits inverse scaling where advanced models resist generic correction, and that targeted ERM feedback recovers 53-59% of entrenched errors where outcome-level feedback fails.

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

TL;DR

Abstract

Paper Structure (115 sections, 8 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms)

This paper contains 115 sections, 8 theorems, 14 equations, 3 figures, 6 tables, 2 algorithms.

Introduction
The causal root of shortcut learning.
Entrenchment.
Our approach.
Organization.
Related Work
Causal reasoning benchmarks for LLMs.
Shortcut learning and spurious correlations.
Causal bandits and causal RL.
Action languages and tool-using agents.
Belief revision and external knowledge structures.
Epistemic Regret Minimization (ERM)
Pearl's Hierarchy and Rung Collapse
Physical Grounding
The ERM Objective and Convergence
...and 100 more sections

Key Result

Proposition 1

For any LLM $\mathcal{M}$ trained on observational corpus $\mathcal{D}$ via autoregressive loss, $\mathcal{M}$ cannot reliably compute $P(Y|\text{do}(X))$ for causal structures not explicitly represented in $\mathcal{D}$.

Figures (3)

Figure 1: Three-layer ERM architecture. Layer 1 revises $G_t$ (orange, domain-specific). Layer 2 updates $\mathcal{F}_t$ (red, cross-domain). Both are external to the frozen LLM. Layer 3 (routing, not shown) acts on persistent residual regret.
Figure 2: Rung Collapse Rates across model generations. Error bars represent 95% confidence intervals ($N=1{,}360$). The Future Frontier (GPT-5.2, blue) significantly reduces causal hallucinations, but is outperformed by the Anthropic Anomaly (Claude 3.5, red).
Figure 3: The Inverse Scaling of Steerability. As models transition from primitive to reasoning-heavy, their susceptibility to generic challenges plummets (gray line), revealing Epistemic Stubbornness. However, targeted Epistemic Regret Minimization (blue line) successfully breaks this entrenchment. Error bars represent 95% CIs.

Theorems & Definitions (19)

Definition 1: Rung Collapse
Proposition 1: LLM Rung Collapse
Definition 2: Aleatoric Success
Lemma 1: Modularity Under Physical Intervention
Theorem 1: Physical Grounding
Corollary 1: Confounder Immunity
Theorem 2: Prevention of Aleatoric Entrenchment
Theorem 3: Asymptotic L2 Recovery
proof
proof
...and 9 more

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

TL;DR

Abstract

Right for the Wrong Reasons: Epistemic Regret Minimization for Causal Rung Collapse in LLMs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (19)