Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Zijian Song; Qichang Li; Sihan Qin; Yuhao Chen; Tianshui Chen; Liang Lin; Guangrun Wang

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, Guangrun Wang

TL;DR

PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks, demonstrates superior capability in physically complex tasks such as grasping transparent objects.

Abstract

The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream manipulation.To ensure efficient convergence, we incorporate causal masking, inverse kinematics, Lookahead Multi-Token Prediction (L-MTP), and key-value (KV) caching. Experimental results on the Libero and ManiSkill benchmarks demonstrate that PhysGen consistently outperforms robust baselines, surpassing OpenVLA and WorldVLA by margins of 13.8% and 8.8%, respectively. Notably, in real-world scenarios, PhysGen matches the performance of large-scale action-pretrained models like $π_0$ without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

TL;DR

Abstract

without requiring prior action-specific pretraining, demonstrating superior capability in physically complex tasks such as grasping transparent objects. These findings validate the potential of extracting physical intuition from pretrained video generators to facilitate generalizable robotic manipulation.

Paper Structure (21 sections, 13 equations, 6 figures, 4 tables)

This paper contains 21 sections, 13 equations, 6 figures, 4 tables.

Introduction
Related Work
Vision-Language-Action Models
Video-Action Joint Prediction
Non-Quantized Autoregressive Models
Preliminaries
Diffusion Models
Autoregression without Quantization
Video Autoregressive Model
Method
Overview
Model Architecture
Model Design Choices
Loss Function
Implementation Details
...and 6 more sections

Figures (6)

Figure 1: The PhysGen Framework: Repurposing Video Generation as a World Simulator. Our approach unifies perception and control through Physical Autoregression. The model operates on a sequence of continuous physical tokens (red), which fuse visual context (orange) and embodiment actions (black) to predict their joint evolution. Acting as a predictive world model, PhysGen runs in synchronization with the physical environment (blue): at each step, predicted tokens are decoded into executed actions and anticipated video frames. The resulting environmental feedback is re-encoded into the stream, enabling a seamless continuous loop of planning and interaction.
Figure 2: The model architecture of PhysGen. Starting from text tokens, PhysGen leverages a Causal Transformer to autoregressively predict physical tokens. Notably, we apply a frame diffusion and an action diffusion to estimate the conditional distributions of visual and action signals in continuous space. The structure of the action diffusion network is shown on the right column, where the predicted token serves as the conditional vector and is injected through cross-attention.
Figure 3: The causal attention mask. Frame tokens use chunk-wise full attention; action tokens use temporal causal attention; action tokens unidirectionally attend to frame tokens for visually conditioned control.
Figure 4: Video predictions and actual action executions. Each row shows PhysGen's predicted video alongside the corresponding execution video for five different tasks. The strong visual similarity between predicted and actual action videos highlights the effectiveness of our approach in transferring knowledge from video pretraining.
Figure 5: Visualization of real world manipulation. We deploy our method on a Franka Panda robot across four real-world tasks: Pick Transparency, Pick Cube, Press Button, and Stack Cube. The robot executes stable and coherent manipulations across diverse settings, demonstrating effective knowledge transfer from video generation to real-world manipulation.
...and 1 more figures

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

TL;DR

Abstract

Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)