Table of Contents
Fetching ...

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

Ruichen Chen, Keith G. Mills, Liyao Jiang, Chao Gao, Di Niu

TL;DR

This work targets the heavy computational burden of global self-attention in Diffusion Transformer models used for text-to-video and text-to-image generation. It introduces Re-ttention, a training-free ultra-sparse attention approach that reconstructs full-attention behavior by reusing the softmax denominator statistics from previous denoising steps and applying residual caching to correct normalization. Empirically, Re-ttention achieves extremely high sparsity (up to 96.9%) with quality on par with or exceeding strong sparse baselines across CogVideoX and PixArt DiTs, on both VBench-based video metrics and GenEval/HPSv2/COCO image benchmarks. The method offers a practical, low-overhead route to scalable diffusion-based generation and outlines avenues for dynamic masking and extension to autoregressive visual models.

Abstract

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

Re-ttention: Ultra Sparse Visual Generation via Attention Statistical Reshape

TL;DR

This work targets the heavy computational burden of global self-attention in Diffusion Transformer models used for text-to-video and text-to-image generation. It introduces Re-ttention, a training-free ultra-sparse attention approach that reconstructs full-attention behavior by reusing the softmax denominator statistics from previous denoising steps and applying residual caching to correct normalization. Empirically, Re-ttention achieves extremely high sparsity (up to 96.9%) with quality on par with or exceeding strong sparse baselines across CogVideoX and PixArt DiTs, on both VBench-based video metrics and GenEval/HPSv2/COCO image benchmarks. The method offers a practical, low-overhead route to scalable diffusion-based generation and outlines avenues for dynamic masking and extension to autoregressive visual models.

Abstract

Diffusion Transformers (DiT) have become the de-facto model for generating high-quality visual content like videos and images. A huge bottleneck is the attention mechanism where complexity scales quadratically with resolution and video length. One logical way to lessen this burden is sparse attention, where only a subset of tokens or patches are included in the calculation. However, existing techniques fail to preserve visual quality at extremely high sparsity levels and might even incur non-negligible compute overheads. To address this concern, we propose Re-ttention, which implements very high sparse attention for visual generation models by leveraging the temporal redundancy of Diffusion Models to overcome the probabilistic normalization shift within the attention mechanism. Specifically, Re-ttention reshapes attention scores based on the prior softmax distribution history in order to preserve the visual quality of the full quadratic attention at very high sparsity levels. Experimental results on T2V/T2I models such as CogVideoX and the PixArt DiTs demonstrate that Re-ttention requires as few as 3.1% of the tokens during inference, outperforming contemporary methods like FastDiTAttn, Sparse VideoGen and MInference.

Paper Structure

This paper contains 24 sections, 13 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Visual comparison using CogVideoX-2B yang2024cogvideox T2V model. Columns correspond to different frames. Rows correspond to to different sparse attention methods (sparsity degree in paranthesis; higher is better). Prompt: "a colorful butterfly perching on a bud". More examples in the Appendix.
  • Figure 2: Illusion of attention map $A$ computed by full attention, contemporary sparse attention (window-based) and our proposed Re-ttention. Sparse attention shifts the distribution of attention scores, resulting in degraded performance as sparsity increases. In contrast, Re-ttention re-uses the denominator ratio cached from the previous denoising steps to scale the sparse attention score to the full attention level. Then, we apply residual caching to accurately restore the full attention scores.
  • Figure 3: Quality-sparsity comparison of Re-ttention, Sparse VideoGen (SVG), MInference and DiTFastAttn. $\bigstar$ denotes the sparsity level that prior methods operate under non-degraded conditions.
  • Figure 4: Visual comparison of pre-Softmax and post-Softmax masking on CogVideoX-2B with 66% sparsity, using sliding-window attention beltagy2020longformer.
  • Figure 5: Plotting softmax denominators for full and sparse attention as well as the ratio $\rho$ per Eq. \ref{['eq:softmax_rho']} across 20 steps.
  • ...and 9 more figures