Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling
Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, Shanghang Zhang
TL;DR
Video2Act tackles robust robotic policy learning by explicitly extracting spatial and motion priors from video diffusion latent representations and feeding them to a fast diffusion-transformer action head. It introduces an asynchronous dual-system design: a slow, perceptual System 2 based on a video diffusion model provides structured conditioning, while a fast System 1 generates high-frequency actions via cross-attention conditioning. The approach uses Sobel-based spatial filtering and FFT-based motion features to produce discriminative spatio-motional cues, achieving state-of-the-art average success on RoboTwin simulation and six real-world tasks with notable generalization across unseen conditions. The work includes comprehensive ablations and qualitative analyses, demonstrating that the spatio-motional conditioning improves both stability and diversity of manipulation strategies in dynamic settings.
Abstract
Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.
