Interpreting Physics in Video World Models
Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, Mike Rabbat
TL;DR
The paper addresses whether video world models encode physical information via factorized latent variables or distributed, task-specific representations. It develops a targeted interpretability framework using layerwise probes, subspace analysis, patch-level decoding, and attention ablations across two encoder-based video transformers to locate where physics information becomes accessible. The key findings reveal a Physics Emergence Zone at roughly one-third depth where physical variables become readable, with speed and acceleration appearing early and direction materializing in the zone as a high-dimensional circular code; motion-direction and possible-vs-impossible judgments occupy nearly orthogonal latent subspaces, while local spatiotemporal attention within the zone is causally critical. Together, these results argue against compact, reusable latent physics engines in favor of distributed, task-specific representations, with implications for cognitive science debates, neuroscience parallels, and future physics simulators.
Abstract
A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.
