Table of Contents
Fetching ...

Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, Liang-Chieh Chen

TL;DR

This paper tackles the high computational cost of self-attention in diffusion-based Transformers by exploiting the local sparsity of pretrained attention maps. It introduces GRAT, a training-free two-stage method that groups tokens and restricts attention to structured regions, offering GRAT-B (surrounding blocks) and GRAT-X (criss-cross) variants. The approach delivers substantial real-world speedups (e.g., up to 35.8× on 8192×8192 images and 15.8× on video) while maintaining generation quality on Flux and HunyuanVideo, markedly reducing latency without fine-tuning. The results demonstrate GRAT's potential to enable practical, high-resolution diffusion-based generation on resource-constrained hardware and motivate further research in grouping-based sparse attention for visual synthesis.

Abstract

Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.

Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers

TL;DR

This paper tackles the high computational cost of self-attention in diffusion-based Transformers by exploiting the local sparsity of pretrained attention maps. It introduces GRAT, a training-free two-stage method that groups tokens and restricts attention to structured regions, offering GRAT-B (surrounding blocks) and GRAT-X (criss-cross) variants. The approach delivers substantial real-world speedups (e.g., up to 35.8× on 8192×8192 images and 15.8× on video) while maintaining generation quality on Flux and HunyuanVideo, markedly reducing latency without fine-tuning. The results demonstrate GRAT's potential to enable practical, high-resolution diffusion-based generation on resource-constrained hardware and motivate further research in grouping-based sparse attention for visual synthesis.

Abstract

Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8} speedup over full attention when generating images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.

Paper Structure

This paper contains 16 sections, 7 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Fast high-resolution image and video generation enabled by equipping Flux flux and HunyuanVideo hunyuan with the proposed GRAT, a training-free attention acceleration strategy. GRAT significantly improves inference speed without any fine-tuning or degradation in visual quality.
  • Figure 2: Attention Visualization of Flux flux.Left: A single query token (marked in red) attends only to sparse, local regions. Middle: Visualization of attention maps across all query tokens in an image (each row corresponds to a query). The patterns consistently demonstrate sparsity and locality. Right: Distribution of attention scores, averaged over 100 generated images, plotted as a function of the normalized spatial distance between query and key tokens (1.0 denotes the maximum distance). The attention score—defined as the scaled dot product between query and key tokens after softmax normalization—reflects the contribution of each query-key pair to the final output. The distribution is long-tailed, indicating that spatially close key tokens contribute most to the attention output.
  • Figure 3: Comparison of Attention Schemes. The comparison is based on Flux flux with various attention mechanisms, including Full Attention vaswani2017attention, Neighborhood Attention ramachandran2019standna, and the proposed GRAT-B and GRAT-X. FLOPs Sparsity measures the theoretical reduction in compute relative to Full Attention (0% indicates no reduction). Inference Speedup reflects real-world speedup on an A100 GPU, relative to Full Attention (1$\times$ means no speedup). Farthest Token Distance denotes the maximum distance over which a query can attend—representing the effective receptive field. As shown, GRAT-B achieves the same FLOPs sparsity as Neighborhood Attention (NA) but delivers higher inference speedup. Conversely, GRAT-X maintains comparable speedup while offering a much larger receptive field. Both variants outperform NA in GenEval scores geneval, with GRAT-X notably matching the quality of Full Attention while running 12$\times$ faster.
  • Figure 4: Illustration of Attention Operations. Query tokens are shown in red, and their corresponding attended regions (key and value tokens) are highlighted in light blue. (a) Full Attention vaswani2017attention: each query attends to the entire feature map. (b) Neighborhood Attention ramachandran2019standna: each query attends only to its local spatial neighborhood. (c) Criss-cross Attention: each query attends to tokens in the same row and column. (d–f) In this work, we propose GRAT (GRouping first, ATtending smartly), which first partitions the feature map into non-overlapping groups (d) (each group has $2 \times 2$ tokens in this example). Query tokens within the same group share a common set of key and value tokens, which are restricted to structured regions—such as surrounding blocks (e) or criss-cross patterns (f).
  • Figure 5: Generated Images by Flux flux with Different Attention Mechanisms. We compare the visual results of Full Attention vaswani2017attention (i.e., the original Flux), Neighborhood Attention (NA) na, CLEAR clear, and our proposed GRAT-B and GRAT-X.
  • ...and 4 more figures