Table of Contents
Fetching ...

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

Hongjie Wang, Difan Liu, Yan Kang, Yijun Li, Zhe Lin, Niraj K. Jha, Yuchen Liu

TL;DR

This work tackles the high computational cost of diffusion models by introducing AT-EDM, a training-free framework that prunes tokens in attention blocks at run-time using attention maps. The approach combines a fast token-pruning algorithm, G-WPR, with a similarity-based token recovery to preserve convolution compatibility, and a DSAP schedule to adapt pruning across denoising steps. Empirical results on SD-XL show substantial FLOPs reductions (up to 38.8%) and speed-ups (up to 1.53×) while maintaining near-full-model fidelity and text-image alignment (FID/CLIP). AT-EDM is complementary to existing efficiency methods and demonstrates strong performance gains without backbone retraining, offering practical impact for deploying diffusion models on constrained hardware.

Abstract

Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. Project webpage: https://atedm.github.io.

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

TL;DR

This work tackles the high computational cost of diffusion models by introducing AT-EDM, a training-free framework that prunes tokens in attention blocks at run-time using attention maps. The approach combines a fast token-pruning algorithm, G-WPR, with a similarity-based token recovery to preserve convolution compatibility, and a DSAP schedule to adapt pruning across denoising steps. Empirical results on SD-XL show substantial FLOPs reductions (up to 38.8%) and speed-ups (up to 1.53×) while maintaining near-full-model fidelity and text-image alignment (FID/CLIP). AT-EDM is complementary to existing efficiency methods and demonstrates strong performance gains without backbone retraining, offering practical impact for deploying diffusion models on constrained hardware.

Abstract

Diffusion Models (DMs) have exhibited superior performance in generating high-quality and diverse images. However, this exceptional performance comes at the cost of expensive architectural design, particularly due to the attention module heavily used in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end, we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens, without the need for any retraining. Specifically, for single-denoising-step pruning, we develop a novel ranking algorithm, Generalized Weighted Page Rank (G-WPR), to identify redundant tokens, and a similarity-based recovery method to restore tokens for the convolution operation. In addition, we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g., 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. Project webpage: https://atedm.github.io.
Paper Structure (35 sections, 5 equations, 21 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 5 equations, 21 figures, 2 tables, 1 algorithm.

Figures (21)

  • Figure 1: Examples of applying AT-EDM to SD-XL sdxl. Compared to the full-size model (top row), our accelerated model (bottom row) has around 40% FLOPs reduction while enjoying competitive generation quality at various aspect ratios.
  • Figure 2: U-Net FLOPs breakdown of SD-XL sdxl measured with 1024px image generation. Among components of U-Net (convolution blocks, ResNet blocks, and attention blocks), attention blocks cost the most.
  • Figure 3: Overview of our proposed efficiency enhancement framework AT-EDM. Single-Denoising-Step Token Pruning: (1) We get the attention map from self-attention. (2) We calculate the importance score for each token using G-WPR. (3) We generate pruning masks. (4) We apply the masks to tokens after the feed-forward network to realize token pruning. (5) We repeat Steps (1)-(4) for each consecutive attention layer. (6) Before passing feature maps to the ResNet block, we recover pruned tokens through similarity-based copy. Denoising-Steps-Aware Pruning Schedule: In early steps, we propose to prune fewer tokens and to have less FLOPs reduction. In later steps, we prune more aggressively for higher speedup.
  • Figure 4: Our similarity-based copy method for token recovering resolves the incompatibility between token pruning and ResNet. Token pruning incurs the non-square shape of feature maps and thus is not compatible with ResNet. To address this issue, we propose similarity-based copy to recover the pruned tokens. It first averages the attention map across heads and deletes the rows of pruned tokens to avoid selecting them as the most similar one. Then, it finds the source of the highest attention received for each pruned token and copies the corresponding retained tokens for recovery. After recovering, the tokens can be translated into a spatially-complete feature map to serve as input to ResNet blocks.
  • Figure 5: Variance of attention maps in different denoising steps. We divide the denoising steps into four typical regions: (I) Very-early steps: Variance of attention maps is small and increases rapidly. (II) Mid-early steps: Variance of attention maps is large and increases slowly. (III) Middle steps: Variance of attention maps is large and almost constant. (IV) Last several steps.
  • ...and 16 more figures