Table of Contents
Fetching ...

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao

TL;DR

Video diffusion transformers suffer from high computational cost due to long attention sequences. AsymRnR offers a training-free, model-agnostic solution by asymmetrically reducing attention tokens—manipulating $Q$ and $(K,V)$ independently—followed by restoration, and by adaptively scheduling reductions across blocks and timesteps. The approach is theoretically motivated via a KL-divergence perspective and practically enhanced with a matching cache to curb matching costs, achieving substantial speedups across multiple state-of-the-art DiTs with negligible or even positive effects on quality. This yields practical, generalizable acceleration suitable for real-time or near-real-time video generation without additional training or fine-tuning.

Abstract

Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video DiT sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.

AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration

TL;DR

Video diffusion transformers suffer from high computational cost due to long attention sequences. AsymRnR offers a training-free, model-agnostic solution by asymmetrically reducing attention tokens—manipulating and independently—followed by restoration, and by adaptively scheduling reductions across blocks and timesteps. The approach is theoretically motivated via a KL-divergence perspective and practically enhanced with a matching cache to curb matching costs, achieving substantial speedups across multiple state-of-the-art DiTs with negligible or even positive effects on quality. This yields practical, generalizable acceleration suitable for real-time or near-real-time video generation without additional training or fine-tuning.

Abstract

Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video DiT sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.

Paper Structure

This paper contains 24 sections, 1 theorem, 11 equations, 13 figures, 11 tables.

Key Result

Corollary 3.1

Suppose $\{X_i\}_{i=1}^{l}$ and $\{X'_i\}_{i=1}^{l'}$ are covariance stationary sequences sampled from $\mathcal{P}$ and $\mathcal{P}'$, respectively. A Monte Carlo estimator is given by: where $\rho(i)$ is the nearest-neighbor (NN) Euclidean distanceThis generally holds for $L_p$-distances, where $1 \le p \le \infty$. of $X'_i$ among $\{X'_j\}_{j\ne i}$ and $\nu(i)$ is the NN Euclidean distance

Figures (13)

  • Figure 1: Altering different components in video DiTs leads to varying degradation. Green blocks represent original attention blocks. Blue blocks represent attention blocks where 30% of the query tokens are randomly discarded, allowing only the remaining 70% to contribute to the output. Red blocks represent the same perturbation applied to key and value tokens. The comparison includes perturbing: (a) different features: $Q$ or $K\&V$; (b) different DiT blocks: shallow, medium, or deep; (c) different timesteps: early or later.
  • Figure 2: Overview of (a) symmetric and (b) asymmetric strategies. Both methods reduce the processing sequence length before self-attention to enhance efficiency and subsequently restore it to the original length for dense prediction. SymRnR performs reduction before mapping to $Q$, $K$, and $V$, whereas AsymRnR applies reduction afterward. This flexibility allows for the adaptive assignment of varying reduction rates to individual features. Moreover, AsymRnR supports operations on $Q$, $K$, and $V$ before reducing sequence, such as 3D rotary position embedding ROPE, offering better compatibility. We use image patches for illustrative purposes.
  • Figure 3: CogVideoX CogVideoX attention feature similarity distribution. The shaded areas indicate the confidence interval. Blocks are divided into four groups, each exhibiting distinct trends, with variations observed across different feature types. These patterns remain consistent across generations with diverse contents.
  • Figure 4: Heatmap of matching similarity at different denoising timesteps. The similarities across successive timesteps are nearly identical, but divergence increases with a larger step gap.
  • Figure 5: Qualitative comparison on CogVideoX-2B CogVideoX. ToMe ToMeSD exhibits blurriness (left) and pixelation (right), whereas our AsymRnR consistently performs well. The video examples are provided in the Supplementary Materials.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Corollary 3.1