DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma; Jia Zheng; Zifan Wang; Chuili Jiang; Andy Cui; Junwei Liang; Shuo Yang

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Teli Ma, Jia Zheng, Zifan Wang, Chuili Jiang, Andy Cui, Junwei Liang, Shuo Yang

TL;DR

DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework, is introduced, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

TL;DR

Abstract

Paper Structure (22 sections, 9 equations, 10 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 9 equations, 10 figures, 5 tables, 2 algorithms.

Introduction
Related Works
Vision-Language-Action Models
Video Generation in Robotics
Validation of Video Generation as a Scaling Proxy
DiT4DiT: Unleashing the Potential of Video Model
Preliminaries
Dual-DiT Architecture
Joint Training of Video and Action
Inference
Experiments
Experiment Setup
Comparison against State-of-the-art Policies
Generalization Capability
Ablations
...and 7 more sections

Figures (10)

Figure 1: Proxy objectives for scalable robot policy learning.Left: Comparison of three representative training paradigms: Grounding (object-level semantic alignment), FLARE-stylezheng2025flare latent modeling (VLM-to-future-frame feature prediction), and Video generation (learning physically plausible future dynamics). Right: Video generation serves as the strongest scaling proxy, yielding higher sample efficiency (up to $>10\times$), faster convergence (up to $7\times$), and more favorable scaling trends across data regimes, with consistently better downstream manipulation success than semantic-centric baselines. All results are reported as the average success rate over 24 tasks in the RoboCasa-GR1 tabletop benchmark nasiriany2024robocasabjorck2025gr00t.
Figure 2: Overview of the proposed DiT4DiT framework.Top: Given the current observation and language goal, the video DiT predicts future dynamics and exposes intermediate generative features at the specific flow timestep; these features condition the action DiT to infer control trajectories. The two models are jointly optimized with a dual flow-matching objective for video generation and action prediction. Below: Generated visual plans via video DiT (More examples are shown in Fig. \ref{['fig:video_gen']}).
Figure 3: Asymmetric tri-timestep design. We decouple the diffusion timesteps to optimize joint video-action generation. The video module uses uniform sampling ($\tau_v$) to capture the full denoising trajectory, while the action module uses Beta sampling ($\tau_a$) to focus on critical control phases. Meanwhile, stable visual conditions are extracted at a fixed deterministic timestep ($\tau_f$) from the evolving hidden states ($h_t^1 \rightarrow h_t^0$).
Figure 4: Real-world evaluation suite on the Unitree G1 humanoid robot. The selected tasks evaluate distinct dimensions of robotic proficiency, ranging from high-precision spatial manipulation (e.g., stack up the cups, insert plate into the rack, arrange the flower) to complex, extended-horizon execution (e.g., box packing, drawer interaction).
Figure 5: Real-world evaluation results on the Unitree G1 robot. Success rates are reported across seven diverse household tasks. DiT4DiT comprehensively outperforms both the pre-trained GR00T-N1.5 bjorck2025gr00t and the parameter-matched Qwen3DiT baseline, highlighting the efficiency and efficacy of our framework.
...and 5 more figures

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

TL;DR

Abstract

DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

Authors

TL;DR

Abstract

Table of Contents

Figures (10)