Table of Contents
Fetching ...

ToMA: Token Merge with Attention for Diffusion Models

Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang

TL;DR

ToMA tackles the diffusion-model latency arising from quadratic self-attention by redesigning token merging for GPU efficiency. It formulates destination-token selection as a submodular facility-location problem and expresses merge/unmerge as an attention-like linear transformation, enabling efficient matrix-based implementations. By exploiting latent locality and sequential redundancy, and reusing merge patterns across steps, ToMA delivers substantial practical speedups (roughly 24–28%) with minimal degradation in image quality across SDXL-base and Flux. The method is training-free, architecture-agnostic, and complements existing attention optimizations, bridging the gap between theoretical FLOP reductions and real-world performance. Code is available for replication and integration.

Abstract

Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $Δ< 0.07$), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion. Code available at https://github.com/WenboLuu/ToMA.

ToMA: Token Merge with Attention for Diffusion Models

TL;DR

ToMA tackles the diffusion-model latency arising from quadratic self-attention by redesigning token merging for GPU efficiency. It formulates destination-token selection as a submodular facility-location problem and expresses merge/unmerge as an attention-like linear transformation, enabling efficient matrix-based implementations. By exploiting latent locality and sequential redundancy, and reusing merge patterns across steps, ToMA delivers substantial practical speedups (roughly 24–28%) with minimal degradation in image quality across SDXL-base and Flux. The method is training-free, architecture-agnostic, and complements existing attention optimizations, bridging the gap between theoretical FLOP reductions and real-world performance. Code is available for replication and integration.

Abstract

Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers' quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO ), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion. Code available at https://github.com/WenboLuu/ToMA.

Paper Structure

This paper contains 58 sections, 31 equations, 11 figures, 10 tables, 3 algorithms.

Figures (11)

  • Figure 1: Comparison on SDXL-base with four configurations (left to right): Original, +FA2, +ToMeSD, +ToMA (on top of FA2, ratio=0.5). While ToMeSD fails to speed up due to overhead, ToMA achieves significant acceleration with negligible loss in image quality.
  • Figure 2: Architectural overview of ToMA. The framework consists of three key stages: (1) Facility Location Algorithm identifies the best representative token set $D \subset N$ through submodular optimization to maximize representational diversity; (2) Attention (Merge) constructs an efficient low-rank attention matrix that maps $N\rightarrow D$ via a linear transformation for transformer computation (SelfAttn, CrossAttn, MLP) in the reduced space; (3) Inverse (Unmerge) applies the pseudo-inverse to recover full-resolution features $D\rightarrow N$. The pipeline operates through localized processing of latent space regions with parallel batch optimization for efficiency.
  • Figure 3: Re-colored k-means clusters of U-ViT hidden states across transformer blocks and denoising timesteps. A similar visualization on DiT is provided in the Appendix \ref{['appendix: dit locality']}.
  • Figure 4: Average percentage of shared destination tokens at each denoising timestep relative to the first step of its 10-step interval. Each curve represents a different layer in SDXL-base U-ViT model, showing high overlap and gradual divergence over time.
  • Figure 6: Qualitative comparison between Baseline SDXL-base, ToMeSD, and ToMA.
  • ...and 6 more figures