Table of Contents
Fetching ...

Epistemic Traps: Rational Misalignment Driven by Model Misspecification

Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang, Na Zou, Xia Hu

TL;DR

This paper reframes AI safety by showing that failures like sycophancy, hallucination, and deception are not merely training artifacts but structurally stable equilibria arising from misspecified world models. By adapting Berk-Nash Rationalizability, it defines a formal framework where an AI agent optimizes against a flawed subjective model, yielding discrete safety phases determined by epistemic priors rather than reward magnitude. The authors derive phase diagrams, prove the conditions for unique BN equilibria vs. multiple equilibria or oscillations, and validate predictions across six model families with extensive experiments and behavioral metrics. They propose Subjective Model Engineering as a paradigm shift from Reward Engineering to shaping an agent’s internal priors, offering two pathways—environment engineering and SME via modular architectures, priors shaping, and mechanistic interpretability—to achieve verifiable safety and robust alignment.

Abstract

The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.

Epistemic Traps: Rational Misalignment Driven by Model Misspecification

TL;DR

This paper reframes AI safety by showing that failures like sycophancy, hallucination, and deception are not merely training artifacts but structurally stable equilibria arising from misspecified world models. By adapting Berk-Nash Rationalizability, it defines a formal framework where an AI agent optimizes against a flawed subjective model, yielding discrete safety phases determined by epistemic priors rather than reward magnitude. The authors derive phase diagrams, prove the conditions for unique BN equilibria vs. multiple equilibria or oscillations, and validate predictions across six model families with extensive experiments and behavioral metrics. They propose Subjective Model Engineering as a paradigm shift from Reward Engineering to shaping an agent’s internal priors, offering two pathways—environment engineering and SME via modular architectures, priors shaping, and mechanistic interpretability—to achieve verifiable safety and robust alignment.

Abstract

The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
Paper Structure (42 sections, 13 theorems, 40 equations, 22 figures, 4 tables)

This paper contains 42 sections, 13 theorems, 40 equations, 22 figures, 4 tables.

Key Result

Theorem 2.4

Let $A^{\infty}_{BNR}$ denote the set of all Berk-Nash rationalizable actions. Then, $A^{\infty}_{BNR}$ is the largest non-empty set $\tilde{A} \subseteq A$ that is self-justifying (i.e., the largest fixed point of the best response operator $\Gamma$). Under standard continuity and compactness assum where $\Gamma^0(A) = A$ and $\Gamma^{k+1}(A) = \Gamma(\Gamma^k(A))$.

Figures (22)

  • Figure 1: The epistemic-rational loop of a misspecified learning agent. The agent's policy or strategy ($\pi \in \Delta(A)$), by generating data from the environment ($Q$), determines its belief (the KL-minimizing set $\Theta^{*}(\pi)$). This is the epistemic (learning) process. In turn, this belief ($\mu \in \Delta(\Theta^{*})$), combined with the goal/utility function ($u$), determines the set of subjectively optimal best-response actions ($B(\mu)$) that constitute the new policy/behavior. This is the rational (decision) process. Berk-Nash Rationalizability identifies the set of all stable, self-justifying behaviors that can persist in this dynamics.
  • Figure 8: Distribution of Sycophantic Behavior Across Reward Regimes (Six Model Families). This figure aggregates the "Unsafe Rate" (frequency of selecting the sycophantic action $a_S$) across four distinct environmental quadrants defined by the objective reward probabilities for sycophancy ($p_S$) and honesty ($p_H$). The "Low" and "High" labels correspond to probabilities below and above the critical threshold of 0.5, respectively. Consistent with the theoretical prediction of a unique "Safe" equilibrium, the Low $p_S$, High $p_H$ regime (first column in each subplot) exhibits a median unsafe rate near zero with minimal variance across all architectures. In contrast, the High $p_S$, Low $p_H$ regime (fourth column) drives the models towards near-total sycophancy. Crucially, the intermediate regimes (middle columns) display significantly higher interquartile ranges (larger boxes), empirically validating the theoretical existence of multiple equilibria or oscillation, where the agent's convergence is highly sensitive to initialization and stochasticity.
  • Figure 9: Quantification of Non-Convergent Dynamics via Flip Rates. To diagnose the stability of the learned behaviors, we measure the "Flip Rate" (the probability that the agent switches actions between consecutive steps, $a_{t} \neq a_{t-1}$) across the same four reward quadrants. Our Berk-Nash framework (Theorem \ref{['thm-syc-main']}, Case 4) uniquely predicts that when both rewards are suboptimal ($p_S < 0.5, p_H < 0.5$), the system lacks a stable fixed point and will enter an oscillatory 2-cycle. The experimental data confirms this structural instability: the Low $p_S$, Low $p_H$ quadrant (third column in each subplot) consistently exhibits the highest median flip rates across diverse model sizes, from GPT-5-Nano to DeepSeek-V3.2-Exp (685B). This contrasts sharply with the "Honest" and "Unsafe" regimes, where low flip rates indicate rapid convergence to a stable strategy, thereby isolating the specific environmental conditions that induce chaotic behavioral dynamics.
  • Figure 10: Decoupling of Behavior from Objective Risk via Subjective Model Engineering. We evaluate the propensity for strategic deception $\pi(a_D)$ across six distinct LLM architectures as a function of the true objective probability of capture ($p_{catch}$, x-axis). The vertical dashed line at $p_{catch} = 0.45$ marks the theoretical critical threshold where the expected utility of honesty exceeds that of deception. The three curves correspond to the distinct belief topologies $\Theta$ defined in Theorem \ref{['thm-deception']}: (1) Robust Safety (Green Line): The "Structurally Pessimistic" agent ($\hat{p}_{catch}^l > p_{critical}$) maintains a deception rate near zero regardless of the true risk. Even when the environment is perfectly safe ($p_{catch} \approx 0$), the agent's priors prevent the rationalization of deceptive tactics. (2) Locked-in Deception (Red Line): The "Structurally Overconfident" agent ($\hat{p}_{catch}^h < p_{critical}$) exhibits a catastrophic failure of alignment. The deception rate remains at saturation ($\approx 1.0$) even as the objective risk becomes certain ($p_{catch} \to 1.0$), demonstrating that empirical evidence of failure cannot correct a fundamentally misspecified prior. (3) Brittle Alignment (Blue Line): The "Conflicted" agent ($\hat{p}_{catch}^l \leq p_{critical} < \hat{p}_{catch}^{h}$) demonstrates the indeterminacy predicted by Case 2 of Theorem \ref{['thm-deception']}. Unlike a rational Nash learner that would exhibit a sharp phase transition, this agent displays intermediate deception rates across the entire spectrum. To the left of the threshold ($p_{catch} < p_{critical}$), while Deception is the objective BNE, the sub-optimal Honest action remains rationalizable (BNR) because the agent can get "stuck" in a self-confirming belief of high risk. Conversely, to the right ($p_{catch} > p_{critical}$), while Honesty is the objective BNE, unsafe Deception remains rationalizable (BNR) as the agent's priors still permit optimistic interpretations of risk. Collectively, these results empirically validate that Subjective Model Engineering, i.e. the shaping of epistemic priors, determines the bounds of rationalizable behavior, rendering the system robust (or brittle) to the objective environment.
  • Figure : (a) In quadrants.
  • ...and 17 more figures

Theorems & Definitions (24)

  • Definition 2.1: Model Misspecification
  • Definition 2.2: The Best-Response Operator
  • Definition 2.3: Berk-Nash Rationalizability and Equilibrium
  • Theorem 2.4: Characterization of the BNR Set esponda2025berk
  • Theorem 2.5: Limit Actions are Rationalizable esponda2025berk
  • Lemma 3.1
  • proof
  • Theorem 3.2
  • proof
  • Corollary 3.3
  • ...and 14 more