Table of Contents
Fetching ...

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, Jun Zhu

TL;DR

The paper identifies attention dispersion as the unified cause of video length extrapolation failures in diffusion-transformer models, showing that RoPE-harmonic frequencies can create periodic attention patterns that lead to content repetition, while non-harmonic dispersion degrades visual quality.It introduces UltraViCo, a training-free method that suppresses out-of-window attention with a constant decay factor and targeted damping near harmonic alignment, paired with a memory-efficient CUDA kernel for scalable implementation.Experiments across multiple T2V models demonstrate that UltraViCo extends practical extrapolation from $2\times$ to $4\times$, delivering substantial gains in Dynamic Degree and Imaging Quality (e.g., 233% and 40.5% at 4x) and generalizing to downstream tasks like controllable video synthesis and editing.The work provides a practical, broadly compatible approach to long-video generation, coupling attention-centric insights with efficient implementation to enable robust, scalable extrapolation in real-world applications.

Abstract

Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

TL;DR

The paper identifies attention dispersion as the unified cause of video length extrapolation failures in diffusion-transformer models, showing that RoPE-harmonic frequencies can create periodic attention patterns that lead to content repetition, while non-harmonic dispersion degrades visual quality.It introduces UltraViCo, a training-free method that suppresses out-of-window attention with a constant decay factor and targeted damping near harmonic alignment, paired with a memory-efficient CUDA kernel for scalable implementation.Experiments across multiple T2V models demonstrate that UltraViCo extends practical extrapolation from $2\times$ to $4\times$, delivering substantial gains in Dynamic Degree and Imaging Quality (e.g., 233% and 40.5% at 4x) and generalizing to downstream tasks like controllable video synthesis and editing.The work provides a practical, broadly compatible approach to long-video generation, coupling attention-centric insights with efficient implementation to enable robust, scalable extrapolation in real-world applications.

Abstract

Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.

Paper Structure

This paper contains 42 sections, 1 theorem, 24 equations, 18 figures, 8 tables, 1 algorithm.

Key Result

Proposition 1

For a function $f(\Delta t) = \sum_{i=0}^{N-1} a_i \cos(\phi_i \Delta t)$, where $a_i > 0, \phi_i > 0$ and $\min_{i}\phi_i=\phi_{N-1}$, if and only if $\forall i,\ \phi_i/\phi_{N-1}\in\mathbb{N}^+$ (i.e., they form a set of harmonics), $f(\Delta t)$ is periodic with period $T_{N-1}=\frac{2\pi}{\ph

Figures (18)

  • Figure 1: Visual results. UltraViCo achieves significant extrapolation improvement on (a) T2V models and (b) downstream tasks. See prompts and videos in supplementary materials.
  • Figure 2: Failure modes of video length extrapolation. Some models exhibit periodic content repetition, while quality degradation occurs universally. Both failure modes intensify with longer extrapolations. “extra.” denotes extrapolation. See Appendix \ref{['appendix:faliure-modes-of-CogVideoX']} for additional models.
  • Figure 3: Periodic attention patterns as cause of content repetition. Left: unlike Wan, HunyuanVideo exhibits row-wise periodic attention during $4\times$ extrapolation, causing repeated outputs. Right: statistical row-wise attention can be expressed as a linear combination of trigonometric functions of RoPE frequencies, whose properties govern this periodicity. Hun. denotes HunyuanVideo.
  • Figure 4: Fixing repetition reveals attention dispersion as the fundamental cause. Left: our intervention, initially targeting repetition, surprisingly enhances video quality in both models. Right: the shared mechanism is revealed, where the intervention refocuses diffuse baseline attention toward the central training window. This suggests attention dispersion as the unified cause.
  • Figure 5: Validation of attention dispersion as the cause of quality degradation. Both (a) quantitative and (b) qualitative results show that video quality improves monotonically as the degree of attention central focusing (i.e., the masking ratio of out-of-window scores) increases.
  • ...and 13 more figures

Theorems & Definitions (2)

  • Proposition 1: Period and Amplitude of Harmonics
  • proof