Table of Contents
Fetching ...

AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

Fernando Rosas, Alexander Boyd, Manuel Baltieri

TL;DR

The paper reframes world models for AI agent evaluation as interfaces that map action sequences to outcomes, arguing for an 'AI in a vat' perspective that treats multiple indistinguishable world models as potential representations. It introduces transducers as a unified formalism to generate these interfaces and systematically derives minimal-world-model constructions, notably via bisimulation and generalized quasi-probabilistic transducers, highlighting an intrinsic efficiency–interpretability trade-off. A key result is that the $\epsilon$-transducer provides the unique, minimal predictive model for real-time agents, while generalised transducers achieve maximal compression at the cost of interpretability. The work further proposes forward interpretability through epistemic (belief) world models and backward interpretability via retrodictive beliefs, including reversible transducers and the BDMSM, offering practical guidelines for designing world models tailored to specific safety and verification goals.

Abstract

Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.

AI in a vat: Fundamental limits of efficient world modelling for agent sandboxing and interpretability

TL;DR

The paper reframes world models for AI agent evaluation as interfaces that map action sequences to outcomes, arguing for an 'AI in a vat' perspective that treats multiple indistinguishable world models as potential representations. It introduces transducers as a unified formalism to generate these interfaces and systematically derives minimal-world-model constructions, notably via bisimulation and generalized quasi-probabilistic transducers, highlighting an intrinsic efficiency–interpretability trade-off. A key result is that the -transducer provides the unique, minimal predictive model for real-time agents, while generalised transducers achieve maximal compression at the cost of interpretability. The work further proposes forward interpretability through epistemic (belief) world models and backward interpretability via retrodictive beliefs, including reversible transducers and the BDMSM, offering practical guidelines for designing world models tailored to specific safety and verification goals.

Abstract

Recent work proposes using world models to generate controlled virtual environments in which AI agents can be tested before deployment to ensure their reliability and safety. However, accurate world models often have high computational demands that can severely restrict the scope and depth of such assessments. Inspired by the classic `brain in a vat' thought experiment, here we investigate ways of simplifying world models that remain agnostic to the AI agent under evaluation. By following principles from computational mechanics, our approach reveals a fundamental trade-off in world model construction between efficiency and interpretability, demonstrating that no single world model can optimise all desirable characteristics. Building on this trade-off, we identify procedures to build world models that either minimise memory requirements, delineate the boundaries of what is learnable, or allow tracking causes of undesirable outcomes. In doing so, this work establishes fundamental limits in world modelling, leading to actionable guidelines that inform core design choices related to effective agent evaluation.

Paper Structure

This paper contains 40 sections, 20 theorems, 109 equations, 5 figures.

Key Result

Lemma 1

A process $S_t$ is a world model for an anticipation-free interface $\mathcal{I}(\bm Y|\bm A)$ if and only if

Figures (5)

  • Figure 1: Recommendations for building world optimal models, including implementations (boxes), transformations (arrows), and design criteria (ellipses).
  • Figure 2: Illustration of an interface (left) and a possible unravelling of it via a presentation with a world model built from the memory states of a transducer (right), as given by \ref{['eq:transducer']}.
  • Figure 3: Illustration of the minimisation of world models. Purple boxes represent reducible models and orange boxes represent minimal ones, and arrows correspond to reductions. Red boxes are generalised models following quasi-probabilities, which (if allowed) establish global minima.
  • Figure 4: Three examples of reversible transducers. Circles represent world states, and arrows represent transitions and their labels describe the associated actions and outputs. For instance, the label $1|0\!\!:\!\!0.5$ on the edge from $s_0$ to $s_1$ indicates that $\Pr(S_{t+1}=s_1,Y_t=1|A_t=0,S_t=s_0)=0.5$.
  • Figure 5: Illustration of different types of transducers: Mealy transducers (a), output-Moore transducers (b), input-Moore transducer (c), and I-O Moore transducer.

Theorems & Definitions (40)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Definition 3
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Definition 4
  • Lemma 5
  • Definition 5
  • ...and 30 more