Limitations of Agents Simulated by Predictive Models
Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna
TL;DR
The paper tackles why predictive models repurposed as agents can fail due to causal confounding, identifying two core failure modes: auto-suggestive delusions and predictor-policy incoherence. It formalizes predictor-policy incoherence and proves that iterative re-training on the model’s own actions can converge the policy toward optimality, while also showing that delusions can be eliminated by environment-informed fine-tuning. Through small-scale experiments with Decision Transformers on simple games (Padlock and Tic-Tac-Toe), the authors demonstrate that online fine-tuning on self-generated data reduces both failure modes, aligning the simulated agent’s behavior with the underlying task structure. The work provides a unifying framework for offline-to-online policy improvement, with implications for RLHF and self-training approaches, and highlights paths for scalable oversight by controlling agentive capabilities.
Abstract
There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
