Table of Contents
Fetching ...

Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

Yueru Jia, Jiaming Liu, Shengbang Liu, Rui Zhou, Wanhe Yu, Yuyang Yan, Xiaowei Chi, Yandong Guo, Boxin Shi, Shanghang Zhang

TL;DR

Video2Act tackles robust robotic policy learning by explicitly extracting spatial and motion priors from video diffusion latent representations and feeding them to a fast diffusion-transformer action head. It introduces an asynchronous dual-system design: a slow, perceptual System 2 based on a video diffusion model provides structured conditioning, while a fast System 1 generates high-frequency actions via cross-attention conditioning. The approach uses Sobel-based spatial filtering and FFT-based motion features to produce discriminative spatio-motional cues, achieving state-of-the-art average success on RoboTwin simulation and six real-world tasks with notable generalization across unseen conditions. The work includes comprehensive ablations and qualitative analyses, demonstrating that the spatio-motional conditioning improves both stability and diversity of manipulation strategies in dynamic settings.

Abstract

Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.

Video2Act: A Dual-System Video Diffusion Policy with Robotic Spatio-Motional Modeling

TL;DR

Video2Act tackles robust robotic policy learning by explicitly extracting spatial and motion priors from video diffusion latent representations and feeding them to a fast diffusion-transformer action head. It introduces an asynchronous dual-system design: a slow, perceptual System 2 based on a video diffusion model provides structured conditioning, while a fast System 1 generates high-frequency actions via cross-attention conditioning. The approach uses Sobel-based spatial filtering and FFT-based motion features to produce discriminative spatio-motional cues, achieving state-of-the-art average success on RoboTwin simulation and six real-world tasks with notable generalization across unseen conditions. The work includes comprehensive ablations and qualitative analyses, demonstrating that the spatio-motional conditioning improves both stability and diversity of manipulation strategies in dynamic settings.

Abstract

Robust perception and dynamics modeling are fundamental to real-world robotic policy learning. Recent methods employ video diffusion models (VDMs) to enhance robotic policies, improving their understanding and modeling of the physical world. However, existing approaches overlook the coherent and physically consistent motion representations inherently encoded across frames in VDMs. To this end, we propose Video2Act, a framework that efficiently guides robotic action learning by explicitly integrating spatial and motion-aware representations. Building on the inherent representations of VDMs, we extract foreground boundaries and inter-frame motion variations while filtering out background noise and task-irrelevant biases. These refined representations are then used as additional conditioning inputs to a diffusion transformer (DiT) action head, enabling it to reason about what to manipulate and how to move. To mitigate inference inefficiency, we propose an asynchronous dual-system design, where the VDM functions as the slow System 2 and the DiT head as the fast System 1, working collaboratively to generate adaptive actions. By providing motion-aware conditions to System 1, Video2Act maintains stable manipulation even with low-frequency updates from the VDM. For evaluation, Video2Act surpasses previous state-of-the-art VLA methods by 7.7% in simulation and 21.7% in real-world tasks in terms of average success rate, further exhibiting strong generalization capabilities.

Paper Structure

This paper contains 46 sections, 11 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of Video2Act. Unlike the static image–token concatenation used in traditional VLA models (a) or the direct VDM feature conditioning approach (b), our asynchronous dual-system model (c) employs a slow-system perceptual VDM to explicitly extract spatial and motion representations, while a fast-system action decoder performs high-frequency and stable robot control.
  • Figure 2: Qualitative analysis of latent representations. We visualize Grad-CAM activations for DINOv2, SigLIP and the Video Diffusion Model (VDM) during the block handover task, observed from two common robotic settings: a static third-person (Head Camera View) and a dynamic ego-centric (Wrist Camera View). The heatmaps for standard image encoders (DINOv2, SigLIP) are diffuse, unstable, and shift focus irregularly. In contrast, the VDM features consistently attend to the foreground objects being manipulated, demonstrating strong spatial structure awareness even under severe ego-motion.
  • Figure 3: Video2Act Framework. Video2Act employs an asynchronous dual-system framework consisting of a slow perceptual VDM (System 2) and a fast action head (System 1). System 2 extracts refined spatial and motion representations from two image inputs: high-resolution images for spatial filtering via Sobel operators and long-horizon sequences for motion extraction via FFT. These low-frequency spatio-motional features serve as conditioning inputs to System 1, which simultaneously receives high-frequency image tokens. Through cross-attention conditioning, these asynchronously updated signals are effectively fused, enabling robust and real-time action generation.
  • Figure 4: Real-world experiment results across six manipulation tasks on the Agilex Cobot Magic platform. All methods are trained on 100 demonstrations per task and compared against two closely related baselines, RDT and VPP. We report success rates over 10 rollouts per task under diverse tabletop configurations.
  • Figure 5: Ablation Study. We investigate (a) the effectiveness of spatio-motional feature extraction and (b) how the operating ratio influences success rate and action generation frequency.
  • ...and 14 more figures