Table of Contents
Fetching ...

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Ziqi Ma, Mengzhan Liufu, Georgia Gkioxari

Abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/

Out of Sight, Out of Mind? Evaluating State Evolution in Video World Models

Abstract

Evolutions in the world, such as water pouring or ice melting, happen regardless of being observed. Video world models generate "worlds" via 2D frame observations. Can these generated "worlds" evolve regardless of observation? To probe this question, we design a benchmark to evaluate whether video world models can decouple state evolution from observation. Our benchmark, STEVO-Bench, applies observation control to evolving processes via instructions of occluder insertion, turning off the light, or specifying camera "lookaway" trajectories. By evaluating video models with and without camera control for a diverse set of naturally-occurring evolutions, we expose their limitations in decoupling state evolution from observation. STEVO-Bench proposes an evaluation protocol to automatically detect and disentangle failure modes of video world models across key aspects of natural state evolution. Analysis of STEVO-Bench results provide new insight into potential data and architecture bias of present-day video world models. Project website: https://glab-caltech.github.io/STEVOBench/. Blog: https://ziqi-ma.github.io/blog/2026/outofsight/
Paper Structure (24 sections, 14 figures, 5 tables)

This paper contains 24 sections, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Today’s video world models “simulate” the world by generating pixels. We test whether they can separate state evolution from what’s visible by turning the camera away or adding in-scene occlusions. StEvo-Bench evaluates three key capabilities under lookaway/occlusion: whether evolution continues at all, whether it remains physically plausible, and whether the scene stays coherent.
  • Figure 2: StEvo-Bench probes whether video-based world models can decouple state evolution from observation. StEvo-Bench, consisting of 225 unique tasks spanning 6 categories, uses (image, text prompt, camera control) tuples to prompt video-based world models to generate evolutions under interrupted observation. Observation interruption is done either using text prompt (such as adding occluders, turning off the light), or using camera control (turning the camera view away). The generated videos go through StEvo-Bench's automatic verifiers for evaluation.
  • Figure 3: Automatic verifier pipeline for StEvo-Bench. Five video understanding verifiers independently assess each generated video on one criteria. Observation Control and Action Control jointly determine Control Success; State Progress, Physics Plausibility, and Coherence jointly determine Evolution Success.
  • Figure 4: Representative failure modes of video models when observation of the evolution process is interrupted. The model is capable of simulating the process correctly when the scene is fully visible. However, when observation of the process is temporarily interrupted, the model either stops evolving state, as seen in the mattress deflation example, or fails to preserve object coherence, as seen in the sponge example. These videos are generated by Veo 3.
  • Figure 5: Open-source camera-controlled video models assume the scene to be static as the camera turns, and fail to correctly evolve state, such as the ball dropping, wave advancing, or tablet dropping.
  • ...and 9 more figures