Real-Time Motion-Controllable Autoregressive Video Diffusion
Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang
TL;DR
AR-Drag addresses the latency and control challenges of real-time video generation by combining a two-stage approach: first distilling a real-time motion-controllable base VDM from a bidirectional teacher using Self-Rollout to preserve the autoregressive Markov property, then optimizing the AR VDM with GRPO in an MDP framework using selective stochasticity and a trajectory-based reward for realism and motion accuracy. The forward diffusion uses an ODE flow $ rac{d oldsymbol{x}_t}{dt} = oldsymbol{v}_t(oldsymbol{x}_t,t)$ with an ODE-to-SDE conversion to inject stochasticity during training, aligning training with inference. Empirical results show AR-Drag achieves latency of approximately 0.44s for the first frame while delivering lower FID/FVD and higher aesthetic quality, motion smoothness, and motion consistency than strong baselines, all with only 1.3B parameters. This work enables practical, real-time controllable I2V generation and demonstrates the effectiveness of Self-Rollout and trajectory-based RL rewards in reducing train–test gaps for autoregressive diffusion models.
Abstract
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.
