Table of Contents
Fetching ...

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, Hanwang Zhang

TL;DR

This work tackles the problem of generating videos that obey physical laws by introducing the Diffusion Timestep Tokenizer (DDT) and the Phys-AR framework, which combines symbolic reasoning with reinforcement learning. A two-stage process first transfers symbolic knowledge to a language model, then uses GRPO-based RL with velocity and mass rewards to derive physical laws during video generation. Experiments on PhyWorld across uniform, parabolic, and collision motions show that autoregressive generation with DDT tokens generalizes better to unseen physical conditions than diffusion-based or spatial-token baselines. The results demonstrate that integrating symbolic tokens with physics-oriented RL yields physically consistent video generation and improved out-of-distribution robustness, with potential for scaling to larger, more complex world models.

Abstract

Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

TL;DR

This work tackles the problem of generating videos that obey physical laws by introducing the Diffusion Timestep Tokenizer (DDT) and the Phys-AR framework, which combines symbolic reasoning with reinforcement learning. A two-stage process first transfers symbolic knowledge to a language model, then uses GRPO-based RL with velocity and mass rewards to derive physical laws during video generation. Experiments on PhyWorld across uniform, parabolic, and collision motions show that autoregressive generation with DDT tokens generalizes better to unseen physical conditions than diffusion-based or spatial-token baselines. The results demonstrate that integrating symbolic tokens with physics-oriented RL yields physically consistent video generation and improved out-of-distribution robustness, with potential for scaling to larger, more complex world models.

Abstract

Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.

Paper Structure

This paper contains 29 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Illustration of autoregressive-based(AR) physical video generation. After given the first 3 frames as conditions: (a) The AR model based on the spatial token generate the ball that deviates from the correct trajectory in the predicted video, which indicates it has not reasoned correctly about the physical laws; (b) the AR model based on our diffusion timestep token correctly generates a video that conforms to the physical laws.
  • Figure 2: The overview of our methods. (a): The architecture of diffusion timestep tokenizer encodes an image to a recursive sequence of discrete tokens. (b): An Auto-Regressive architecture which learns new image tokens based on next token prediction. (c): Reinforcement learning infers physical laws through reward signals based on physical variables.
  • Figure 3: Comparison of the velocity errors of the balls using different methods for generating in-distribution and out-of-distribution videos, given the first three frames as input. The prediction error of the DDT-based AR model on out-of-distribution data is at the $10^{-2}$ magnitude, while the errors of the DiT model and the spatial-based AR model are at the $10^{-1}$ magnitude.
  • Figure 4: Comparison of the generated video results for the 3 physical motions, the dotted line indicates the correct motion trajectory, and the arrow indicates the progression of time. For the parabolic, we show the predicted video frames superimposed for ease of comparison.
  • Figure 5: Comparison of the response of different token embedding to image changes. The response of DDT token is basically the same as that of text token, while VQGAN does not behave the same as text token due to the symmetry of 2D image space.
  • ...and 7 more figures