Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian; Denis Korzhenkov; Amirhossein Habibian

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Mohsen Ghafoorian, Denis Korzhenkov, Amirhossein Habibian

TL;DR

This paper tackles the prohibitive quadratic cost of self-attention in transformer-based video diffusion models by introducing Attention Surgery, a framework to retrofit pretrained DiTs with linear or hybrid attention without training from scratch. It combines a novel hybrid attention mechanism with learnable feature maps, attention distillation, and a block-rate optimization strategy to allocate attention styles per transformer block under a global compute budget, followed by lightweight fine-tuning. The method achieves competitive performance on Wan2.1 1.3B with substantial on-device efficiency gains, validated by VBench benchmarks and a human study, while requiring modest compute (less than 0.4k GPU-hours for the surgery). These results suggest practical deployments of efficient video diffusion with near-state-of-the-art quality, and point to future work integrating causality to enable RNN-like diffusion at scale.

Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos. Project page is available at: https://qualcomm-ai-research.github.io/attention-surgery.

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

TL;DR

Abstract

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (25)