Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models
Tianle Cheng, Zeyan Zhang, Kaifeng Gao, Jun Xiao
TL;DR
This work tackles long video generation with autoregressive diffusion models by introducing Adaptive Begin-of-Video Tokens (ada-BOV), which provide dynamic, per-attention-block global guidance modulated by recent frames. It couples ada-BOV with a refinement strategy for stream denoising that extends the effective sampling trajectory without increasing attention-window size, and a disturbance-augmented training schedule to improve robustness. Empirically, the approach achieves state-of-the-art results on Minecraft and Sky Timelapse across quality and temporal-coherence metrics while maintaining efficiency, and ablations validate the benefits of per-block guidance, the refinement strategy, and the training schedules. The method offers a practical path to high-quality, coherent long videos from diffusion models and provides insights into training dynamics for streaming video generation.
Abstract
Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens (ada-BOV) for autoregressive VDMs. The BOV tokens are special learnable embeddings on VDMs. They adaptively absorb denoised preceding frames via an adaptive-layer-norm-like modulation. This design preserves the global consistency while allowing for flexible conditioning in dynamic scenarios. To ensure the quality of local dynamics essential in modulating BOV tokens, we further propose a refinement strategy for stream denoising. It decouples the sampling trajectory length from the attention window size constraint, leading to improved local guidance and overall imaging quality. We also propose a disturbance-augmented training noise schedule, which balances the convergence speed with model robustness for the stream denoising. Extensive experiments demonstrate that our method achieves compelling qualitative and quantitative results across multiple metrics.
