Table of Contents
Fetching ...

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Ethan Smith, Nayan Saxena, Aninda Saha

TL;DR

Dense attention in image diffusion models incurs high compute and memory costs, limiting practical high-resolution generation. The paper introduces ToDo, a training-free token downsampling method that combines spatially contiguity-based token merging with an attention modification that downscales keys and values, preserving queries. This yields up to 4.5x speedups at high resolutions with fidelity comparable to baseline and provides evidence of latent feature redundancy that supports sparse attention. The approach is practical on standard GPUs and may generalize to other attention-based generative models, enabling more scalable high-resolution diffusion outputs.

Abstract

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

TL;DR

Dense attention in image diffusion models incurs high compute and memory costs, limiting practical high-resolution generation. The paper introduces ToDo, a training-free token downsampling method that combines spatially contiguity-based token merging with an attention modification that downscales keys and values, preserving queries. This yields up to 4.5x speedups at high resolutions with fidelity comparable to baseline and provides evidence of latent feature redundancy that supports sparse attention. The approach is practical on standard GPUs and may generalize to other attention-based generative models, enabling more scalable high-resolution diffusion outputs.

Abstract

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.
Paper Structure (13 sections, 2 equations, 4 figures, 1 table)

This paper contains 13 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: A visualization of our method. From a given latent or image, we subsample positions on the grid in a strided fashion for the keys and values used in attention maintaining the full set of query tokens. Link to demo video is https://www.youtube.com/watch?v=e73aE7rFGrg.
  • Figure 2: Qualitative comparison of attention methods with: 25% of tokens at $1024\times 1024$, 11% at $1536 \times 1536$, and 6% at $2048\times 2048$, maintaining a consistent token count of 4096 post-merging.
  • Figure 3: Inference throughput, measured in seconds, across resolutions using attention methods at various merge ratios, with bars representing the relative performance increase against the baseline.
  • Figure 4: Lowest cosine similarity between tokens in a $3\times3$ area across diffusion timesteps and U-Net locations extracted from 10 generations of different prompts at $1024 \times 1024$. Timesteps out of 50 indicate noise reduction; Depth 0 is initial resolution, Depth 1 is after 2x downsampling. Up/down denotes encoder/decoder blocks.