Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin; Liyu Jia; Wentao Hu; Kaihang Pan; Zhongqi Yue; Wei Zhao; Jingyuan Chen; Fei Wu; Hanwang Zhang

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

Wang Lin, Liyu Jia, Wentao Hu, Kaihang Pan, Zhongqi Yue, Wei Zhao, Jingyuan Chen, Fei Wu, Hanwang Zhang

TL;DR

This work tackles the problem of generating videos that obey physical laws by introducing the Diffusion Timestep Tokenizer (DDT) and the Phys-AR framework, which combines symbolic reasoning with reinforcement learning. A two-stage process first transfers symbolic knowledge to a language model, then uses GRPO-based RL with velocity and mass rewards to derive physical laws during video generation. Experiments on PhyWorld across uniform, parabolic, and collision motions show that autoregressive generation with DDT tokens generalizes better to unseen physical conditions than diffusion-based or spatial-token baselines. The results demonstrate that integrating symbolic tokens with physics-oriented RL yields physically consistent video generation and improved out-of-distribution robustness, with potential for scaling to larger, more complex world models.

Abstract

Despite recent progress in video generation, producing videos that adhere to physical laws remains a significant challenge. Traditional diffusion-based methods struggle to extrapolate to unseen physical conditions (eg, velocity) due to their reliance on data-driven approximations. To address this, we propose to integrate symbolic reasoning and reinforcement learning to enforce physical consistency in video generation. We first introduce the Diffusion Timestep Tokenizer (DDT), which learns discrete, recursive visual tokens by recovering visual attributes lost during the diffusion process. The recursive visual tokens enable symbolic reasoning by a large language model. Based on it, we propose the Phys-AR framework, which consists of two stages: The first stage uses supervised fine-tuning to transfer symbolic knowledge, while the second stage applies reinforcement learning to optimize the model's reasoning abilities through reward functions based on physical conditions. Our approach allows the model to dynamically adjust and improve the physical properties of generated videos, ensuring adherence to physical laws. Experimental results demonstrate that PhysAR can generate videos that are physically consistent.

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

TL;DR

Abstract

Reasoning Physical Video Generation with Diffusion Timestep Tokens via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)