Table of Contents
Fetching ...

Limitations of Agents Simulated by Predictive Models

Raymond Douglas, Jacek Karwowski, Chan Bae, Andis Draguns, Victoria Krakovna

TL;DR

The paper tackles why predictive models repurposed as agents can fail due to causal confounding, identifying two core failure modes: auto-suggestive delusions and predictor-policy incoherence. It formalizes predictor-policy incoherence and proves that iterative re-training on the model’s own actions can converge the policy toward optimality, while also showing that delusions can be eliminated by environment-informed fine-tuning. Through small-scale experiments with Decision Transformers on simple games (Padlock and Tic-Tac-Toe), the authors demonstrate that online fine-tuning on self-generated data reduces both failure modes, aligning the simulated agent’s behavior with the underlying task structure. The work provides a unifying framework for offline-to-online policy improvement, with implications for RLHF and self-training approaches, and highlights paths for scalable oversight by controlling agentive capabilities.

Abstract

There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.

Limitations of Agents Simulated by Predictive Models

TL;DR

The paper tackles why predictive models repurposed as agents can fail due to causal confounding, identifying two core failure modes: auto-suggestive delusions and predictor-policy incoherence. It formalizes predictor-policy incoherence and proves that iterative re-training on the model’s own actions can converge the policy toward optimality, while also showing that delusions can be eliminated by environment-informed fine-tuning. Through small-scale experiments with Decision Transformers on simple games (Padlock and Tic-Tac-Toe), the authors demonstrate that online fine-tuning on self-generated data reduces both failure modes, aligning the simulated agent’s behavior with the underlying task structure. The work provides a unifying framework for offline-to-online policy improvement, with implications for RLHF and self-training approaches, and highlights paths for scalable oversight by controlling agentive capabilities.

Abstract

There is increasing focus on adapting predictive models into agent-like systems, most notably AI assistants based on language models. We outline two structural reasons for why these models can fail when turned into agents. First, we discuss auto-suggestive delusions. Prior work has shown theoretically that models fail to imitate agents that generated the training data if the agents relied on hidden observations: the hidden observations act as confounding variables, and the models treat actions they generate as evidence for nonexistent observations. Second, we introduce and formally study a related, novel limitation: predictor-policy incoherence. When a model generates a sequence of actions, the model's implicit prediction of the policy that generated those actions can serve as a confounding variable. The result is that models choose actions as if they expect future actions to be suboptimal, causing them to be overly conservative. We show that both of those failures are fixed by including a feedback loop from the environment, that is, re-training the models on their own actions. We give simple demonstrations of both limitations using Decision Transformers and confirm that empirical results agree with our conceptual and formal analysis. Our treatment provides a unifying view of those failure modes, and informs the question of why fine-tuning offline learned policies with online learning makes them more effective.
Paper Structure (32 sections, 3 theorems, 28 equations, 10 figures, 2 tables)

This paper contains 32 sections, 3 theorems, 28 equations, 10 figures, 2 tables.

Key Result

Theorem 3.6

There exists some optimal policy $\pi^*(a|s)$, such that the sequence of policies $\pi_k(a_t|s_t)$ converges to $\pi^*$ in $KL$-distance: where $\pi_k = \mathcal{G}(\pi_{k-1}) = \mathcal{G}^{k}(\pi_0)$.

Figures (10)

  • Figure 1: Agents derived from predictive models might fail because of causal confounding, but fine-tuning on their own output addresses those issues.
  • Figure 2: The analogies and differences between two modes of causal confounding.
  • Figure 3: A diagram of the Stock trader example. States are represented in blue; observations are in orange; lines represent actions and associated transitions. Dotted lines are transitions which the expert is never observed to take. The agent simulated by the model then falls into auto-suggestive delusion: since it doesn't distinguish between the starting states, it believes it gets profit no matter what action it takes.
  • Figure 4: Representation of possible trajectories in the three cards game. The bottom row represents the probability of reward $P(R)$ in each end state. The bolded line is the optimal strategy.
  • Figure 5: The Padlock game involves an agent trying to open a padlock with a 15-bit code (for presentation, we limit this to 5 bits here). The first three are always the same; the last two are randomised. The agent offline learns from a dataset of expert plays, who know the combination and get it right on the first try. The agent correctly sets first three bits, however, it does not know the combination, and only deludes itself into thinking that it solved the game.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Example 1: Stock trader
  • Example 2: Three cards game
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Theorem 3.6
  • Corollary 3.7
  • Definition 3.8
  • ...and 2 more