Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Yifan Pu; Zhuofan Xia; Jiayi Guo; Dongchen Han; Qixiu Li; Duo Li; Yuhui Yuan; Ji Li; Yizeng Han; Shiji Song; Gao Huang; Xiu Li

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li

TL;DR

We address the high computational cost of Diffusion Transformers arising from global self-attention by identifying query-key redundancy that is especially pronounced in early denoising steps. The authors introduce attention mediators—an extra, compact set of tokens that interact separately with keys and queries—to compress the interaction and reduce complexity to a linear-like regime, augmented by a time-step dependent mediator schedule. The mediator-based DiT achieves lower FLOPs and improved image quality (e.g., state-of-the-art FID scores such as $2.01$ on $256\times256$ with cfg=$1.5$ when integrated with SiT) across resolutions, including high-resolution generation where speedups are more substantial. This approach enables more efficient diffusion-based synthesis and flexible budgeting for inference, making high-quality diffusion generation more practical in real-world deployments.

Abstract

This paper identifies significant redundancy in the query-key interactions within self-attention mechanisms of diffusion transformer models, particularly during the early stages of denoising diffusion steps. In response to this observation, we present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. By modulating the number of mediator tokens during the denoising generation phases, our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Concurrently, integrating mediator tokens simplifies the attention module's complexity to a linear scale, enhancing the efficiency of global attention processes. Additionally, we propose a time-step dynamic mediator token adjustment mechanism that further decreases the required computational FLOPs for generation, simultaneously facilitating the generation of high-quality images within the constraints of varied inference budgets. Extensive experiments demonstrate that the proposed method can improve the generated image quality while also reducing the inference cost of diffusion transformers. When integrated with the recent work SiT, our method achieves a state-of-the-art FID score of 2.01. The source code is available at https://github.com/LeapLabTHU/Attention-Mediators.

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

TL;DR

with cfg=

when integrated with SiT) across resolutions, including high-resolution generation where speedups are more substantial. This approach enables more efficient diffusion-based synthesis and flexible budgeting for inference, making high-quality diffusion generation more practical in real-world deployments.

Abstract

Paper Structure (23 sections, 11 equations, 5 figures, 3 tables)

This paper contains 23 sections, 11 equations, 5 figures, 3 tables.

Introduction
Related Works
Diffusion Transformers
Attention with Linear Complexity
Dynamic Neural Networks
Attention Redundancies Along Denoising Steps
Background of Attention
Jensen-Shannon Divergence as A Redundancy Metric
Redundancies Along Time Steps
Efficient DiTs with Attention Mediators
Attention Mediators
Complexity Analysis
Time Step-wise Mediator Adjusting
Experiments
Experimental Setups
...and 8 more sections

Figures (5)

Figure 1: (a) shows the JSD-based redundancy score defined in \ref{['sec:jsd']} evaluated on DiT-S/2 model along with diffusion time steps. The score is computed over 32 samples and averaged by different attention heads in every layer. (b) shows the same redundancy score of all the 12 layers of SiT-S/2 model with the SDE sampler.
Figure 2: Ablation for optimized mediator token adjustment schedule. (a) Trade-off between FID-50K and FLOPs. (b) Trade-off between sFID-50K and FLOPs.
Figure 3: Main Results of the proposed method in $256\times256$ resolution. Each string of red dots is obtained by adjusting the mediator token number with optimized thresholds. (a) Comparison with DiT peebles2023scalable and SiT ma2024sit; (b) Zoomed in results around SiT-B/2.
Figure 4: High resolution image generation results.
Figure 5: Sampled images by SiT-XL/2 models endowed with our method trained on ImageNet 256$\times{}$256 resolution with cfg$=$4.0.

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

TL;DR

Abstract

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Authors

TL;DR

Abstract

Table of Contents

Figures (5)