Table of Contents
Fetching ...

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

TL;DR

This paper tackles the prohibitive quadratic cost of self-attention in transformer-based video diffusion models by introducing Attention Surgery, a framework to retrofit pretrained DiTs with linear or hybrid attention without training from scratch. It combines a novel hybrid attention mechanism with learnable feature maps, attention distillation, and a block-rate optimization strategy to allocate attention styles per transformer block under a global compute budget, followed by lightweight fine-tuning. The method achieves competitive performance on Wan2.1 1.3B with substantial on-device efficiency gains, validated by VBench benchmarks and a human study, while requiring modest compute (less than 0.4k GPU-hours for the surgery). These results suggest practical deployments of efficient video diffusion with near-state-of-the-art quality, and point to future work integrating causality to enable RNN-like diffusion at scale.

Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

TL;DR

This paper tackles the prohibitive quadratic cost of self-attention in transformer-based video diffusion models by introducing Attention Surgery, a framework to retrofit pretrained DiTs with linear or hybrid attention without training from scratch. It combines a novel hybrid attention mechanism with learnable feature maps, attention distillation, and a block-rate optimization strategy to allocate attention styles per transformer block under a global compute budget, followed by lightweight fine-tuning. The method achieves competitive performance on Wan2.1 1.3B with substantial on-device efficiency gains, validated by VBench benchmarks and a human study, while requiring modest compute (less than 0.4k GPU-hours for the surgery). These results suggest practical deployments of efficient video diffusion with near-state-of-the-art quality, and point to future work integrating causality to enable RNN-like diffusion at scale.

Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

Paper Structure

This paper contains 15 sections, 9 equations, 25 figures, 9 tables, 1 algorithm.

Figures (25)

  • Figure 1: Left: Impact of the proposed method components: attention distillation and hybrid attention. The linear/hybrid models are obtained within fewer than 0.4k GPU-hours. Prompt: "An astronaut flying in space, Van Gogh style.". Right: Compute growth comparison between Wan2.1 1.3B flash attention blocks and of attention surgery on FLOPs (top) and Snapdragon8-Gen4 mobile latency (bottom).
  • Figure 2: Overview of the attention distillation for the proposed attention surgery method. The example illustrates token separation with a hybridization rate of 3.
  • Figure 3: Per-block distillation error (top-left) and compute implications of $\phi$ architectural parameters and the attention hybrid rate.
  • Figure 4: Sample qualitative video frames from hybrid models with varying numbers of hybrid blocks (15, 20, 25) and hybrid rates (2, 4, 8). For each configuration, the left frame shows the result after layer-wise attention distillation, and the right frame shows the result after 1,000 fine-tuning iterations. Prompt: A man is reading a book sitting on the cloud.
  • Figure 5: The total DiT FLOPs percentages versus the VBench score of original Wan2.1 1.3B model compared to various hybrid configurations or 320$\times$480 (left) and 480$\times$832 (right) resolutions.
  • ...and 20 more figures