Table of Contents
Fetching ...

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

Hanling Zhang, Rundong Su, Zhihang Yuan, Pengtao Chen, Mingzhu Shen Yibo Fan, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

This work tackles the computational bottleneck of attention in Multimodal Diffusion Transformers (MMDiT) by introducing DiTFastAttnV2, a post-training compression framework built on head-wise arrow attention, head-wise caching, and a dedicated fused kernel. A lightweight calibration metric and progressive, per-head plan optimization enable compression plan search in minutes rather than hours, delivering up to $68\%$ reduction in attention FLOPs and up to $1.5\times$ end-to-end speedup on $2K$ image generation while preserving perceptual quality. Experiments on Stable Diffusion 3 and FLUX validate strong sparsity-Quality trade-offs, showing robust generation with reduced computation. The approach offers a practical pathway to scalable, efficient multi-modal diffusion systems by tailoring attention compression to per-head dynamics and leveraging efficient kernel implementations.

Abstract

Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers

TL;DR

This work tackles the computational bottleneck of attention in Multimodal Diffusion Transformers (MMDiT) by introducing DiTFastAttnV2, a post-training compression framework built on head-wise arrow attention, head-wise caching, and a dedicated fused kernel. A lightweight calibration metric and progressive, per-head plan optimization enable compression plan search in minutes rather than hours, delivering up to reduction in attention FLOPs and up to end-to-end speedup on image generation while preserving perceptual quality. Experiments on Stable Diffusion 3 and FLUX validate strong sparsity-Quality trade-offs, showing robust generation with reduced computation. The approach offers a practical pathway to scalable, efficient multi-modal diffusion systems by tailoring attention compression to per-head dynamics and leveraging efficient kernel implementations.

Abstract

Text-to-image generation models, especially Multimodal Diffusion Transformers (MMDiT), have shown remarkable progress in generating high-quality images. However, these models often face significant computational bottlenecks, particularly in attention mechanisms, which hinder their scalability and efficiency. In this paper, we introduce DiTFastAttnV2, a post-training compression method designed to accelerate attention in MMDiT. Through an in-depth analysis of MMDiT's attention patterns, we identify key differences from prior DiT-based methods and propose head-wise arrow attention and caching mechanisms to dynamically adjust attention heads, effectively bridging this gap. We also design an Efficient Fused Kernel for further acceleration. By leveraging local metric methods and optimization techniques, our approach significantly reduces the search time for optimal compression schemes to just minutes while maintaining generation quality. Furthermore, with the customized kernel, DiTFastAttnV2 achieves a 68% reduction in attention FLOPs and 1.5x end-to-end speedup on 2K image generation without compromising visual fidelity.

Paper Structure

This paper contains 23 sections, 2 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Visual comparison of image generation results using DiTFastAttnV2 under different compression threshold $\delta$. The percentage under each image stands for the attention reduction ratios under the setting. Our method achieves up to 69% reduction in attention computation while preserving high-quality generation outputs. The visual fidelity remains consistent even at higher compression rates.
  • Figure 2: DiT and MMDiT block architecture. In MMDiT, after projections, visual and text tokens are concatenated for a joint self-attention.
  • Figure 3: Overview of DiTFastAttnV2
  • Figure 4: Left: Selected head attention map examples. Head 10 exhibits global attention in the visual-visual token interaction area, while heads 14 and 20 exhibit different extents of local attention patterns. In text interaction areas, attention varies with different prompts. Right: Different heads within a layer exhibit different levels of redundancy across denoising steps. The attention similarity between adjacent time steps in Head 21 is significantly higher than that in Head 2.
  • Figure 5: Visualization of arrow attention maps in our efficient kernel implementation. The kernel operates in a block sparse way, transforming mix blocks into dense blocks to reduce memory access overhead.
  • ...and 3 more figures