Table of Contents
Fetching ...

Real-Time Motion-Controllable Autoregressive Video Diffusion

Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang

TL;DR

AR-Drag addresses the latency and control challenges of real-time video generation by combining a two-stage approach: first distilling a real-time motion-controllable base VDM from a bidirectional teacher using Self-Rollout to preserve the autoregressive Markov property, then optimizing the AR VDM with GRPO in an MDP framework using selective stochasticity and a trajectory-based reward for realism and motion accuracy. The forward diffusion uses an ODE flow $ rac{d oldsymbol{x}_t}{dt} = oldsymbol{v}_t(oldsymbol{x}_t,t)$ with an ODE-to-SDE conversion to inject stochasticity during training, aligning training with inference. Empirical results show AR-Drag achieves latency of approximately 0.44s for the first frame while delivering lower FID/FVD and higher aesthetic quality, motion smoothness, and motion consistency than strong baselines, all with only 1.3B parameters. This work enables practical, real-time controllable I2V generation and demonstrates the effectiveness of Self-Rollout and trajectory-based RL rewards in reducing train–test gaps for autoregressive diffusion models.

Abstract

Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

Real-Time Motion-Controllable Autoregressive Video Diffusion

TL;DR

AR-Drag addresses the latency and control challenges of real-time video generation by combining a two-stage approach: first distilling a real-time motion-controllable base VDM from a bidirectional teacher using Self-Rollout to preserve the autoregressive Markov property, then optimizing the AR VDM with GRPO in an MDP framework using selective stochasticity and a trajectory-based reward for realism and motion accuracy. The forward diffusion uses an ODE flow with an ODE-to-SDE conversion to inject stochasticity during training, aligning training with inference. Empirical results show AR-Drag achieves latency of approximately 0.44s for the first frame while delivering lower FID/FVD and higher aesthetic quality, motion smoothness, and motion consistency than strong baselines, all with only 1.3B parameters. This work enables practical, real-time controllable I2V generation and demonstrates the effectiveness of Self-Rollout and trajectory-based RL rewards in reducing train–test gaps for autoregressive diffusion models.

Abstract

Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.

Paper Structure

This paper contains 17 sections, 13 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Comparison for motion-controllable video generation. (a) Bidirectional VDMs denoise all frames jointly; motion control can be adjusted only after all frames are generated, causing high latency. (b) In contrast, AR VDMs generate frames sequentially; motion control can be updated frame by frame and, if unsatisfactory, regenerated on the fly, enabling real-time adjustment. (c) Our method achieves significantly lower latency while maintaining superior FID performance.
  • Figure 2: Comparison between typical AR VDMs and Self-Rollout. Self-Rollout faithfully follows the inference process during training, minimizing the train–test gap and naturally preserving the Markov property.
  • Figure 3: Qualitative comparisons with Tora and Self-Forcing across different prompts, data domains, and resolutions, demonstrating the superior fidelity and controllability of our method.
  • Figure 4: Ablation on key training strategies. Prompt: movement following the trajectory.
  • Figure 5: Visualization on diverse motion. Prompt: movement following the trajectory.
  • ...and 1 more figures