UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu
TL;DR
The paper identifies attention dispersion as the unified cause of video length extrapolation failures in diffusion-transformer models, showing that RoPE-harmonic frequencies can create periodic attention patterns that lead to content repetition, while non-harmonic dispersion degrades visual quality.It introduces UltraViCo, a training-free method that suppresses out-of-window attention with a constant decay factor and targeted damping near harmonic alignment, paired with a memory-efficient CUDA kernel for scalable implementation.Experiments across multiple T2V models demonstrate that UltraViCo extends practical extrapolation from $2\times$ to $4\times$, delivering substantial gains in Dynamic Degree and Imaging Quality (e.g., 233% and 40.5% at 4x) and generalizing to downstream tasks like controllable video synthesis and editing.The work provides a practical, broadly compatible approach to long-video generation, coupling attention-centric insights with efficient implementation to enable robust, scalable extrapolation in real-world applications.
Abstract
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
