End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo; Ceyuan Yang; Hao He; Yang Zhao; Meng Wei; Zhenheng Yang; Weilin Huang; Dahua Lin

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin

TL;DR

Exposure bias in autoregressive video diffusion causes error accumulation during long rollouts. The paper introduces Resampling Forcing, a teacher-free end-to-end training framework that uses self-resampling of history frames and a sparse causal mask to enable parallel frame training. A history routing mechanism maintains near-constant attention complexity for long videos while preserving global dependencies. Empirically, the approach achieves comparable quality to distillation-based baselines and superior temporal consistency for longer videos, offering practical efficiency without bidirectional teachers.

Abstract

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

TL;DR

Abstract

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)