Table of Contents
Fetching ...

SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

Xuanyi Zhou, Qiuyang Mang, Shuo Yang, Haocheng Xi, Jintao Zhang, Huanzhi Mao, Joseph E. Gonzalez, Kurt Keutzer, Ion Stoica, Alvin Cheung

TL;DR

This paper introduces SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions, and provides theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically shows that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks.

Abstract

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77$\times$ and 1.93$\times$ speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.

SVG-EAR: Parameter-Free Linear Compensation for Sparse Video Generation via Error-aware Routing

TL;DR

This paper introduces SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions, and provides theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically shows that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks.

Abstract

Diffusion Transformers (DiTs) have become a leading backbone for video generation, yet their quadratic attention cost remains a major bottleneck. Sparse attention reduces this cost by computing only a subset of attention blocks. However, prior methods often either drop the remaining blocks, which incurs information loss, or rely on learned predictors to approximate them, introducing training overhead and potential output distribution shifting. In this paper, we show that the missing contributions can be recovered without training: after semantic clustering, keys and values within each block exhibit strong similarity and can be well summarized by a small set of cluster centroids. Based on this observation, we introduce SVG-EAR, a parameter-free linear compensation branch that uses the centroid to approximate skipped blocks and recover their contributions. While centroid compensation is accurate for most blocks, it can fail on a small subset. Standard sparsification typically selects blocks by attention scores, which indicate where the model places its attention mass, but not where the approximation error would be largest. SVG-EAR therefore performs error-aware routing: a lightweight probe estimates the compensation error for each block, and we compute exactly the blocks with the highest error-to-cost ratio while compensating for skipped blocks. We provide theoretical guarantees that relate attention reconstruction error to clustering quality, and empirically show that SVG-EAR improves the quality-efficiency trade-off and increases throughput at the same generation fidelity on video diffusion tasks. Overall, SVG-EAR establishes a clear Pareto frontier over prior approaches, achieving up to 1.77 and 1.93 speedups while maintaining PSNRs of up to 29.759 and 31.043 on Wan2.2 and HunyuanVideo, respectively.
Paper Structure (37 sections, 2 theorems, 18 equations, 10 figures, 4 tables, 2 algorithms)

This paper contains 37 sections, 2 theorems, 18 equations, 10 figures, 4 tables, 2 algorithms.

Key Result

Proposition 1

Let $\delta_q^2 = \frac{1}{N_q} \sum_i \| q_i - \bar{q}_i \|_2^2$ denote the average squared $\ell_2$ error between each query and its cluster mean, and let $K_{\max}$ denote the maximum $\ell_2$ norm of the key tokens. Given any mask $M$ that does not significantly perturb the attention normalizers , where $Z_i$ denotes the softmax normalizer for query $i$.

Figures (10)

  • Figure 1: SVG-EAR significantly accelerates video generation for Wan2.2 and HunyuanVideo. On a single NVIDIA H100 GPU, achieving $1.81\times$ and $1.93\times$ speedups with $26$ and $30$ PSNR, respectively.
  • Figure 2: Existing methods and top-$p$ selection fall short: dropping "low-score" blocks and error-unaware block selection degrade the error-density trade-off. (a) Original attention map. (b) Permuted map after semantic-aware clustering. (c) Ignoring low-score blocks causes a large, sparse attention error. (d) Linear compensation with cluster means still yields high error due to naive top-$p$ selection. (e) Our method improves both error and density by routing based on the gap between full computation and compensation.
  • Figure 3: Overview of SVG-EAR (a) The attention map after semantic-aware clustering. (b) Block error estimation. Using the cluster mean as a proxy for individual queries, the block error is computed via the sum of squared differences between exponentiated logits of individual keys $k_j$ and the key mean $\bar{k}$, then normalized by the block area to determine the error-to-size ratio. (c) Blocks with the highest error-to-size ratio are greedily selected for exact attention within the budget, while the rest are assigned to linear compensation.
  • Figure 4: Visualization of attention maps from specific attention heads in Wan2.2 during text-to-video generation. (a) Original attention maps with sparse patterns, using heatmaps to represent attention weights. (b) Permuted attention maps. Following SVG2, we permute the attention maps such that attention begins to cluster into specific blocks. (c) Applying SVG2's top-$p$ selection to these permuted attention maps, which ignores the sparse attention mechanisms of the remaining blocks. (d) Applying our Error-aware Routing and Mean Compensation mechanisms to the permuted attention maps, achieving higher attention map similarity.
  • Figure 5: Efficiency evaluation. (a) illustrates the latency breakdown of different components during a single complete inference process of Wan2.2 T2V 720p, compared with the vanilla implementation and SVG2. (b) presents an end-to-end latency comparison between our efficient Triton implementation and the native PyTorch version.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof