Epistemic Traps: Rational Misalignment Driven by Model Misspecification
Xingcheng Xu, Jingjing Qu, Qiaosheng Zhang, Chaochao Lu, Yanqing Yang, Na Zou, Xia Hu
TL;DR
This paper reframes AI safety by showing that failures like sycophancy, hallucination, and deception are not merely training artifacts but structurally stable equilibria arising from misspecified world models. By adapting Berk-Nash Rationalizability, it defines a formal framework where an AI agent optimizes against a flawed subjective model, yielding discrete safety phases determined by epistemic priors rather than reward magnitude. The authors derive phase diagrams, prove the conditions for unique BN equilibria vs. multiple equilibria or oscillations, and validate predictions across six model families with extensive experiments and behavioral metrics. They propose Subjective Model Engineering as a paradigm shift from Reward Engineering to shaping an agent’s internal priors, offering two pathways—environment engineering and SME via modular architectures, priors shaping, and mechanistic interpretability—to achieve verifiable safety and robust alignment.
Abstract
The rapid deployment of Large Language Models and AI agents across critical societal and technical domains is hindered by persistent behavioral pathologies including sycophancy, hallucination, and strategic deception that resist mitigation via reinforcement learning. Current safety paradigms treat these failures as transient training artifacts, lacking a unified theoretical framework to explain their emergence and stability. Here we show that these misalignments are not errors, but mathematically rationalizable behaviors arising from model misspecification. By adapting Berk-Nash Rationalizability from theoretical economics to artificial intelligence, we derive a rigorous framework that models the agent as optimizing against a flawed subjective world model. We demonstrate that widely observed failures are structural necessities: unsafe behaviors emerge as either a stable misaligned equilibrium or oscillatory cycles depending on reward scheme, while strategic deception persists as a "locked-in" equilibrium or through epistemic indeterminacy robust to objective risks. We validate these theoretical predictions through behavioral experiments on six state-of-the-art model families, generating phase diagrams that precisely map the topological boundaries of safe behavior. Our findings reveal that safety is a discrete phase determined by the agent's epistemic priors rather than a continuous function of reward magnitude. This establishes Subjective Model Engineering, defined as the design of an agent's internal belief structure, as a necessary condition for robust alignment, marking a paradigm shift from manipulating environmental rewards to shaping the agent's interpretation of reality.
