Table of Contents
Fetching ...

Interpreting Physics in Video World Models

Sonia Joseph, Quentin Garrido, Randall Balestriero, Matthew Kowal, Thomas Fel, Shahab Bakhtiari, Blake Richards, Mike Rabbat

TL;DR

The paper addresses whether video world models encode physical information via factorized latent variables or distributed, task-specific representations. It develops a targeted interpretability framework using layerwise probes, subspace analysis, patch-level decoding, and attention ablations across two encoder-based video transformers to locate where physics information becomes accessible. The key findings reveal a Physics Emergence Zone at roughly one-third depth where physical variables become readable, with speed and acceleration appearing early and direction materializing in the zone as a high-dimensional circular code; motion-direction and possible-vs-impossible judgments occupy nearly orthogonal latent subspaces, while local spatiotemporal attention within the zone is causally critical. Together, these results argue against compact, reusable latent physics engines in favor of distributed, task-specific representations, with implications for cognitive science debates, neuroscience parallels, and future physics simulators.

Abstract

A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.

Interpreting Physics in Video World Models

TL;DR

The paper addresses whether video world models encode physical information via factorized latent variables or distributed, task-specific representations. It develops a targeted interpretability framework using layerwise probes, subspace analysis, patch-level decoding, and attention ablations across two encoder-based video transformers to locate where physics information becomes accessible. The key findings reveal a Physics Emergence Zone at roughly one-third depth where physical variables become readable, with speed and acceleration appearing early and direction materializing in the zone as a high-dimensional circular code; motion-direction and possible-vs-impossible judgments occupy nearly orthogonal latent subspaces, while local spatiotemporal attention within the zone is causally critical. Together, these results argue against compact, reusable latent physics engines in favor of distributed, task-specific representations, with implications for cognitive science debates, neuroscience parallels, and future physics simulators.

Abstract

A long-standing question in physical reasoning is whether video-based models need to rely on factorized representations of physical variables in order to make physically accurate predictions, or whether they can implicitly represent such variables in a task-specific, distributed manner. While modern video world models achieve strong performance on intuitive physics benchmarks, it remains unclear which of these representational regimes they implement internally. Here, we present the first interpretability study to directly examine physical representations inside large-scale video encoders. Using layerwise probing, subspace geometry, patch-level decoding, and targeted attention ablations, we characterize where physical information becomes accessible and how it is organized within encoder-based video transformers. Across architectures, we identify a sharp intermediate-depth transition -- which we call the Physics Emergence Zone -- at which physical variables become accessible. Physics-related representations peak shortly after this transition and degrade toward the output layers. Decomposing motion into explicit variables, we find that scalar quantities such as speed and acceleration are available from early layers onwards, whereas motion direction becomes accessible only at the Physics Emergence Zone. Notably, we find that direction is encoded through a high-dimensional population structure with circular geometry, requiring coordinated multi-feature intervention to control. These findings suggest that modern video models do not use factorized representations of physical variables like a classical physics engine. Instead, they use a distributed representation that is nonetheless sufficient for making physical predictions.
Paper Structure (64 sections, 8 equations, 24 figures, 4 tables)

This paper contains 64 sections, 8 equations, 24 figures, 4 tables.

Figures (24)

  • Figure 1: The Physics Emergence Zone consistently emerges one-third in the model's layers. We probe performance on V-JEPA 2 (Large, Huge, and Giant) and VideoMAEv2-G across all layers for the possible-vs-impossible physical reasoning task. The shaded area is the emergence zone one-third through the network, where the network starts performing well on the task for linear probes (left) and attentive-MLP probes (right). For full results, including for the VideoMAE-v2 family, see Appendix \ref{['app:additional_experiments:possible-vs-impossible']}.
  • Figure 2: Motion property encoding across layers. (a) Example from our synthetic ball rolling dataset showing motion direction $\theta$. (b) Cartesian representations ($v_x$, $a_x$) show similar layer-wise emergence patterns, with acceleration available at the same time as velocity. (c) Polar representations (speed, direction, acceleration magnitude) across layer fraction. Direction is only available at the Physics Emergence Zone, while magnitudes are available early.
  • Figure 3: Spatiotemporally local attention heads crop up uniquely at the Physics Emergence Zone. Per-head attention locality heatmap showing the coexistence of local and long-range heads at this stage. See Appendix Fig. \ref{['fig:attention_locality']} for a line plot of attention distance.
  • Figure 4: Direction neurons form a ring-shaped population code with structured redundancy. (a) At the one-third emergence zone, direction-selective MLP units tile the full angular space and organize into a circular population code. (b) Individual neurons exhibit smooth, sinusoidal tuning to motion direction. (c) Probe accuracy across successive orthogonalizations exhibits a sawtooth pattern, indicating structured redundancy consistent with paired (e.g., sine–cosine) feature encodings.
  • Figure 5: A possible and impossible example from each fold of IntPhys.
  • ...and 19 more figures