Table of Contents
Fetching ...

Chain of Time: In-Context Physical Simulation with Image Generation Models

YingQiao Wang, Eric Bigelow, Boyi Li, Tomer Ullman

TL;DR

The paper introduces Chain-of-Time, a cognitively inspired method that augments image generation models with stepwise, in-context physical simulation by generating intermediate future frames without additional training. Framed by human mental simulation, the approach decomposes prediction into de-rendering, simulation, and rendering steps and uses two prompts to unfold time in increments. Empirical results across 2D and 3D domains (motion, gravity, fluids, collisions) show improved accuracy in several settings, while revealing parameter-estimation weaknesses in 3D physics. The work advances interpretability and diagnostic capabilities for physical reasoning in vision-language models, with practical implications for tasks requiring temporal coherence and physical understanding.

Abstract

We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an image generation model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions. Our analysis also highlights particular cases where the image generation model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.

Chain of Time: In-Context Physical Simulation with Image Generation Models

TL;DR

The paper introduces Chain-of-Time, a cognitively inspired method that augments image generation models with stepwise, in-context physical simulation by generating intermediate future frames without additional training. Framed by human mental simulation, the approach decomposes prediction into de-rendering, simulation, and rendering steps and uses two prompts to unfold time in increments. Empirical results across 2D and 3D domains (motion, gravity, fluids, collisions) show improved accuracy in several settings, while revealing parameter-estimation weaknesses in 3D physics. The work advances interpretability and diagnostic capabilities for physical reasoning in vision-language models, with practical implications for tasks requiring temporal coherence and physical understanding.

Abstract

We propose a novel cognitively-inspired method to improve and interpret physical simulation in vision-language models. Our ``Chain of Time" method involves generating a series of intermediate images during a simulation, and it is motivated by in-context reasoning in machine learning, as well as mental simulation in humans. Chain of Time is used at inference time, and requires no additional fine-tuning. We apply the Chain-of-Time method to synthetic and real-world domains, including 2-D graphics simulations and natural 3-D videos. These domains test a variety of particular physical properties, including velocity, acceleration, fluid dynamics, and conservation of momentum. We found that using Chain-of-Time simulation substantially improves the performance of a state-of-the-art image generation model. Beyond examining performance, we also analyzed the specific states of the world simulated by an image model at each time step, which sheds light on the dynamics underlying these simulations. This analysis reveals insights that are hidden from traditional evaluations of physical reasoning, including cases where an image generation model is able to simulate physical properties that unfold over time, such as velocity, gravity, and collisions. Our analysis also highlights particular cases where the image generation model struggles to infer particular physical parameters from input images, despite being capable of simulating relevant physical processes.

Paper Structure

This paper contains 29 sections, 1 equation, 21 figures.

Figures (21)

  • Figure 1: (Left, Top) We study physical reasoning in multi-modal image generation models by providing the model a sequence of input images showing a scene in subsequent time steps, and having the model generate an image that simulates what the scene will look like some time in the future. Accurately predicting future world states requires reasoning about physical properties. (Left, Bottom) Our method, Chain of Time, allows these models to simulate a sequence of images in-context, generating one image at a time, with the last image representing the final prediction of the scene. (Right) We use four experimental domains designed to test models' ability to reason about specific physical properties: Velocity, Gravity, Fluid Dynamics, and Collision.
  • Figure 2: In our paradigm, we give an IGM a sequence of input images, along with a prompt instructing the model to simulate the scene into the future for a specified length of time (Left). As a baseline, Direct Prediction (Middle, Top) directly predicts the final state (Right) without intermediate steps. We propose a novel method, Chain of Time (Middle, Bottom), which instead generates a sequence of images corresponding to a step-by-step simulation of the scene on the way to the predicted final state, with each mid-point image serving as input and output in mid-point computation.
  • Figure 3: Chain of Time is a composition of three components: De-rendering $\phi$, Simulation $\tau$, and Rendering $\phi^{-1}$. De-rendering operates by converting input images $I_0 \ldots I_t$ into world states $X_t$, which represent a physical simulation over time. Chain of Time begins with an initial prompt $L_{\text{init}}$ and iteratively generates a sequence of in-context output images $I_{t+1} \ldots I_{T}$ with follow-up prompts $L_{t+1} \ldots L_T$
  • Figure 4: Prediction errors for all four domains, averaged across all data for each domain. Prediction error is measured by taking the average RMSE between the ground-truth positions (location of focal object, or water level) and the positions predicted by the IGM. Error bars are 95% CI. We generally find a monotonic relationship between Chain-of-Time precision and performance. In the case of Fluids, we observe that the initial state simulated by the IGM is inaccurate, and this error compounds with increasing degrees of simulation, see Section \ref{['sec:vlm-physparam']} for detailed analysis.
  • Figure 5: (Left) Prediction error rate across methods and time periods. In the collision domain, we find lower error rates in image model predictions for periods before and after the bouncing collision, compared with time periods during which the collision occurs. This disparity increases with Chain of Time, since performance improves for the before/after periods, but error remains high for the collision time period. (Right) Simulated ball location (orange) using Chain-of-Time 0.2s in the Bouncing domain follow a similar U-shaped curve as the ground truth ball location (red). Ball locations are shown here for a single video (orange), with predictions aggregated across all samples for the three time periods (before/during/after collision).
  • ...and 16 more figures