Table of Contents
Fetching ...

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

Yuwei Guo, Ceyuan Yang, Hao He, Yang Zhao, Meng Wei, Zhenheng Yang, Weilin Huang, Dahua Lin

TL;DR

Exposure bias in autoregressive video diffusion causes error accumulation during long rollouts. The paper introduces Resampling Forcing, a teacher-free end-to-end training framework that uses self-resampling of history frames and a sparse causal mask to enable parallel frame training. A history routing mechanism maintains near-constant attention complexity for long videos while preserving global dependencies. Empirically, the approach achieves comparable quality to distillation-based baselines and superior temporal consistency for longer videos, offering practical efficiency without bidirectional teachers.

Abstract

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

End-to-End Training for Autoregressive Video Diffusion via Self-Resampling

TL;DR

Exposure bias in autoregressive video diffusion causes error accumulation during long rollouts. The paper introduces Resampling Forcing, a teacher-free end-to-end training framework that uses self-resampling of history frames and a sparse causal mask to enable parallel frame training. A history routing mechanism maintains near-constant attention complexity for long videos while preserving global dependencies. Empirically, the approach achieves comparable quality to distillation-based baselines and superior temporal consistency for longer videos, offering practical efficiency without bidirectional teachers.

Abstract

Autoregressive video diffusion models hold promise for world simulation but are vulnerable to exposure bias arising from the train-test mismatch. While recent works address this via post-training, they typically rely on a bidirectional teacher model or online discriminator. To achieve an end-to-end solution, we introduce Resampling Forcing, a teacher-free framework that enables training autoregressive video models from scratch and at scale. Central to our approach is a self-resampling scheme that simulates inference-time model errors on history frames during training. Conditioned on these degraded histories, a sparse causal mask enforces temporal causality while enabling parallel training with frame-level diffusion loss. To facilitate efficient long-horizon generation, we further introduce history routing, a parameter-free mechanism that dynamically retrieves the top-k most relevant history frames for each query. Experiments demonstrate that our approach achieves performance comparable to distillation-based baselines while exhibiting superior temporal consistency on longer videos owing to native-length training.

Paper Structure

This paper contains 10 sections, 9 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: We introduce Resampling Forcing, an end-to-end, teacher-free training framework for autoregressive video diffusion models. Top: The teacher forcing accumulates errors and leads to video collapse. Middle: Distilled from a short bidirectional teacher, Self Forcing suffers from the degraded quality on longer videos. Bottom: Our method offers stable quality by native training on long videos.
  • Figure 2: Error Accumulation. Top: Models trained with ground truth input add and compound errors autoregressively. Bottom: We train the model on imperfect input with simulated model errors, stabilizing the long-horizon autoregressive generation. The gray circle represents the closest match in the ground truth distribution.
  • Figure 3: Resampling Forcing. (a) To simulate inference-time model error, we add noise on clean videos to a sampled timestep $t_s$, then use the online model weights to autoregressively complete the remaining denoising steps. (b) The model is parallel trained with frame-level diffusion loss. (c) A sparse causal mask restricts each frame to attend only to its clean history frames.
  • Figure 4: History Routing Mechanism. Our routing mechanism dynamically selects the top-$k$ important frames to attend. In this illustration, we show a $k=2$ example, where only the 1st and 3rd frames are selected for the 4th frame's query token $\boldsymbol{q}_4$.
  • Figure 5: Qualitative Comparisons. Top: We compare with representative autoregressive video generation models, showing our method's stable quality on long video generation. Bottom: Compared with LongLive yang2025longlive that distilled from a short bidirectional teacher, our method exhibits better causality. We use dashed lines to denote the highest liquid level, and red arrows to highlight the liquid level in each frame.
  • ...and 3 more figures