AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration
Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, Dacheng Tao
TL;DR
Video diffusion transformers suffer from high computational cost due to long attention sequences. AsymRnR offers a training-free, model-agnostic solution by asymmetrically reducing attention tokens—manipulating $Q$ and $(K,V)$ independently—followed by restoration, and by adaptively scheduling reductions across blocks and timesteps. The approach is theoretically motivated via a KL-divergence perspective and practically enhanced with a matching cache to curb matching costs, achieving substantial speedups across multiple state-of-the-art DiTs with negligible or even positive effects on quality. This yields practical, generalizable acceleration suitable for real-time or near-real-time video generation without additional training or fine-tuning.
Abstract
Diffusion Transformers (DiTs) have proven effective in generating high-quality videos but are hindered by high computational costs. Existing video DiT sampling acceleration methods often rely on costly fine-tuning or exhibit limited generalization capabilities. We propose Asymmetric Reduction and Restoration (AsymRnR), a training-free and model-agnostic method to accelerate video DiTs. It builds on the observation that redundancies of feature tokens in DiTs vary significantly across different model blocks, denoising steps, and feature types. Our AsymRnR asymmetrically reduces redundant tokens in the attention operation, achieving acceleration with negligible degradation in output quality and, in some cases, even improving it. We also tailored a reduction schedule to distribute the reduction across components adaptively. To further accelerate this process, we introduce a matching cache for more efficient reduction. Backed by theoretical foundations and extensive experimental validation, AsymRnR integrates into state-of-the-art video DiTs and offers substantial speedup.
