Table of Contents
Fetching ...

Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

Tianle Cheng, Zeyan Zhang, Kaifeng Gao, Jun Xiao

TL;DR

This work tackles long video generation with autoregressive diffusion models by introducing Adaptive Begin-of-Video Tokens (ada-BOV), which provide dynamic, per-attention-block global guidance modulated by recent frames. It couples ada-BOV with a refinement strategy for stream denoising that extends the effective sampling trajectory without increasing attention-window size, and a disturbance-augmented training schedule to improve robustness. Empirically, the approach achieves state-of-the-art results on Minecraft and Sky Timelapse across quality and temporal-coherence metrics while maintaining efficiency, and ablations validate the benefits of per-block guidance, the refinement strategy, and the training schedules. The method offers a practical path to high-quality, coherent long videos from diffusion models and provides insights into training dynamics for streaming video generation.

Abstract

Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.

Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

TL;DR

This work tackles long video generation with autoregressive diffusion models by introducing Adaptive Begin-of-Video Tokens (ada-BOV), which provide dynamic, per-attention-block global guidance modulated by recent frames. It couples ada-BOV with a refinement strategy for stream denoising that extends the effective sampling trajectory without increasing attention-window size, and a disturbance-augmented training schedule to improve robustness. Empirically, the approach achieves state-of-the-art results on Minecraft and Sky Timelapse across quality and temporal-coherence metrics while maintaining efficiency, and ablations validate the benefits of per-block guidance, the refinement strategy, and the training schedules. The method offers a practical path to high-quality, coherent long videos from diffusion models and provides insights into training dynamics for streaming video generation.

Abstract

Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.

Paper Structure

This paper contains 20 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Pipeline for any frame from pure noise to clean. (a) Vanilla stream denoising VDMs. The dashed arrows indicate that a fixed global reference frame can be optionally injected henschel2025streamingt2vtian2024videotetris. (b) Our approach, where the solid arrows indicate the injection of reference frames. The BOV tokens contain a variety of global features and are modulated with the latest generated frame for flexible guidance. And the refinement strategy for stream denoising supports denser denoising steps.
  • Figure 2: Temporal attention block ($j$-th) with our ada-BOV token design. The BOV token contains global features and is modulated with the latest generated frame.
  • Figure 3: Illustration of the sampling trajectory for frame $i$ with our refined inference strategy. $L$ represents the attention window size, and each autoregressive iteration consists of $n$ substeps.
  • Figure 4: Qualitative examples on Minecraft guss2019minerl, generated by VDT lu2024vdt, OpenSora opensora, FIFO kim2024fifo, and ours. We use purple boxes to represent the low-quality frames in OpenSora and red boxes to represent the flickers between adjacent frames in FIFO.
  • Figure 5: Qualitative examples on Sky Timelapse zhang2020dtvnet, generated by VDT lu2024vdt, Ca2-VDM gao2025ca2, SEINE chen2023seine, OpenSora opensora, FIFO kim2024fifo, DiTCtrl cai2025ditctrl and ours. The baseline methods exhibit significant flaws, including noticeable artifacts in OpenSora, a gradual collapse into static frames in VDT, and severe quality degradation or unnatural color shift in the remaining models.
  • ...and 5 more figures