What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Xinyu Zhang

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Xinyu Zhang

Abstract

World models learn to simulate environment dynamics from experience, enabling sample-efficient reinforcement learning. But what do these models actually represent internally? We apply interpretability techniques--including linear and nonlinear probing, causal interventions, and attention analysis--to two architecturally distinct world models: IRIS (discrete token transformer) and DIAMOND (continuous diffusion UNet), trained on Atari Breakout and Pong. Using linear probes, we find that both models develop linearly decodable representations of game state variables (object positions, scores), with MLP probes yielding only marginally higher R^2, confirming that these representations are approximately linear. Causal interventions--shifting hidden states along probe-derived directions--produce correlated changes in model predictions, providing evidence that representations are functionally used rather than merely correlated. Analysis of IRIS attention heads reveals spatial specialization: specific heads attend preferentially to tokens overlapping with game objects. Multi-baseline token ablation experiments consistently identify object-containing tokens as disproportionately important. Our findings provide interpretability evidence that learned world models develop structured, approximately linear internal representations of environment state across two games and two architectures.

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Abstract

Paper Structure (16 sections, 3 figures, 1 table)

This paper contains 16 sections, 3 figures, 1 table.

Introduction
Method
Models and Ground Truth
Probing Protocol
Causal Intervention Protocol
Attention Analysis and Token Ablation
Results
Linear Representations Across Games
IRIS.
DIAMOND.
Cross-game consistency.
Causal Interventions Confirm Functional Use
Attention and Token Ablation
Discussion and Conclusion
Architectural comparison.
...and 1 more sections

Figures (3)

Figure 1: Probe $R^2$ across layers (in network data-flow order) for IRIS (left) and DIAMOND (right) on Breakout (top) and Pong (bottom). Each line tracks one game-state property; shaded bands show $\pm$1 std over 5-fold CV. IRIS representations are flat across transformer layers, while DIAMOND shows a peaked inverted-V centered on the UNet bottleneck. Note: $y$-axis includes negative $R^2$ values, revealing that DIAMOND's early encoder layers are worse than a constant predictor for ball position.
Figure 2: Causal intervention on Breakout: shifting IRIS layer-5 hidden states along probe directions produces correlated changes in predictions ($r \geq 0.96$ for all properties, measured via KL divergence).
Figure 3: Three-way token ablation on Breakout ($4\times4$ grid). Zero, mean, and random replacement produce consistent importance rankings ($\rho > 0.92$), with token 0 (score/brick region) most critical.

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Abstract

What Do World Models Learn in RL? Probing Latent Representations in Learned Environment Simulators

Authors

Abstract

Table of Contents

Figures (3)