Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers
Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, Liang-Chieh Chen
TL;DR
This paper tackles the high computational cost of self-attention in diffusion-based Transformers by exploiting the local sparsity of pretrained attention maps. It introduces GRAT, a training-free two-stage method that groups tokens and restricts attention to structured regions, offering GRAT-B (surrounding blocks) and GRAT-X (criss-cross) variants. The approach delivers substantial real-world speedups (e.g., up to 35.8× on 8192×8192 images and 15.8× on video) while maintaining generation quality on Flux and HunyuanVideo, markedly reducing latency without fine-tuning. The results demonstrate GRAT's potential to enable practical, high-resolution diffusion-based generation on resource-constrained hardware and motivate further research in grouping-based sparse attention for visual synthesis.
Abstract
Diffusion-based Transformers have demonstrated impressive generative capabilities, but their high computational costs hinder practical deployment, for example, generating an $8192\times 8192$ image can take over an hour on an A100 GPU. In this work, we propose GRAT (\textbf{GR}ouping first, \textbf{AT}tending smartly), a training-free attention acceleration strategy for fast image and video generation without compromising output quality. The key insight is to exploit the inherent sparsity in learned attention maps (which tend to be locally focused) in pretrained Diffusion Transformers and leverage better GPU parallelism. Specifically, GRAT first partitions contiguous tokens into non-overlapping groups, aligning both with GPU execution patterns and the local attention structures learned in pretrained generative Transformers. It then accelerates attention by having all query tokens within the same group share a common set of attendable key and value tokens. These key and value tokens are further restricted to structured regions, such as surrounding blocks or criss-cross regions, significantly reducing computational overhead (e.g., attaining a \textbf{35.8$\times$} speedup over full attention when generating $8192\times 8192$ images) while preserving essential attention patterns and long-range context. We validate GRAT on pretrained Flux and HunyuanVideo for image and video generation, respectively. In both cases, GRAT achieves substantially faster inference without any fine-tuning, while maintaining the performance of full attention. We hope GRAT will inspire future research on accelerating Diffusion Transformers for scalable visual generation.
