Table of Contents
Fetching ...

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, Linfeng Zhang

TL;DR

This work tackles the efficiency bottleneck of diffusion-based image generation by introducing Dynamics-Aware Token Pruning (DaTo), a training-free method that combines token pruning with feature caching while preserving temporal feature dynamics. DaTo identifies high-dynamics tokens via a temporal difference score, propagates dynamics through self-attention using base tokens, and recovers pruned tokens from their closest base tokens; an evolutionary NSGA-II search selects per-timestep caching depth and pruning ratios to balance latency and image quality. The approach yields substantial speedups (up to 9× on Stable Diffusion ImageNet and 7× on COCO-30k) with improved or maintained FID across SDv1.5, SDv2, and SDXL, illustrating strong practical impact without requiring培训. Overall, DaTo advances practical diffusion-model acceleration by jointly optimizing caching and pruning in a dynamics-aware, training-free framework with robust generalization across datasets and model variants.

Abstract

Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9$\times$ speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7$\times$ acceleration coupled with a notable FID reduction of 2.17.

Token Pruning for Caching Better: 9 Times Acceleration on Stable Diffusion for Free

TL;DR

This work tackles the efficiency bottleneck of diffusion-based image generation by introducing Dynamics-Aware Token Pruning (DaTo), a training-free method that combines token pruning with feature caching while preserving temporal feature dynamics. DaTo identifies high-dynamics tokens via a temporal difference score, propagates dynamics through self-attention using base tokens, and recovers pruned tokens from their closest base tokens; an evolutionary NSGA-II search selects per-timestep caching depth and pruning ratios to balance latency and image quality. The approach yields substantial speedups (up to 9× on Stable Diffusion ImageNet and 7× on COCO-30k) with improved or maintained FID across SDv1.5, SDv2, and SDXL, illustrating strong practical impact without requiring培训. Overall, DaTo advances practical diffusion-model acceleration by jointly optimizing caching and pruning in a dynamics-aware, training-free framework with robust generalization across datasets and model variants.

Abstract

Stable Diffusion has achieved remarkable success in the field of text-to-image generation, with its powerful generative capabilities and diverse generation results making a lasting impact. However, its iterative denoising introduces high computational costs and slows generation speed, limiting broader adoption. The community has made numerous efforts to reduce this computational burden, with methods like feature caching attracting attention due to their effectiveness and simplicity. Nonetheless, simply reusing features computed at previous timesteps causes the features across adjacent timesteps to become similar, reducing the dynamics of features over time and ultimately compromising the quality of generated images. In this paper, we introduce a dynamics-aware token pruning (DaTo) approach that addresses the limitations of feature caching. DaTo selectively prunes tokens with lower dynamics, allowing only high-dynamic tokens to participate in self-attention layers, thereby extending feature dynamics across timesteps. DaTo combines feature caching with token pruning in a training-free manner, achieving both temporal and token-wise information reuse. Applied to Stable Diffusion on the ImageNet, our approach delivered a 9 speedup while reducing FID by 0.33, indicating enhanced image quality. On the COCO-30k, we observed a 7 acceleration coupled with a notable FID reduction of 2.17.
Paper Structure (38 sections, 11 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 11 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Feature difference between adjacent timesteps under three acceleration methods: original Stable Diffusion, feature caching sd-deepcache, and DaTo. DaTo produces a distribution that is more similar to the original Stable Diffusion, suggesting that our proposed token pruning method helps restore feature dynamics across timesteps.
  • Figure 2: Visual comparison across multiple models and methods, including SD and SDXL. The methods include caching only (DeepCache), token pruning only (ToMeSD), and a naive combination of both (DeepCache & ToMeSD). Our proposed model maintains high fidelity to the original image content while preserving intricate details and aligning closely with textual prompts, achieving superior image quality in both text-to-image and image-to-image generation tasks.
  • Figure 3: Overall pipeline of DaTo. (I) Search Optimal Caching and Pruning Strategy. We use evolutionary search to identify the optimal caching depth $d$ and pruning ratio $r$ for each timestep by minimizing both the time latency and FID Scorefid. (II) Feature Caching with Optimal Strategy. Feature caching employs dynamic token pruning based on the optimal strategy for efficiency. (III) Feature Pruning with Optimal Strategy. (a):Base Token selection based on adjacent timestep differences: Divide the image into $s\times s$ patches and select the base token with the largest noise difference between adjacent time steps in each patch. (b) Align Base Tokens with and w/o CFG guidance: Make the positions of the base tokens without CFG guidance match the positions of the base tokens with CFG guidance. (c) Pruned token selection: $r$ tokens that exhibit the highest consine similarity to the base tokens are chosen as the pruned tokens, and (d) Pruned Token recovery: The pruned tokens are restored by copying from the base tokens that are most similar to them.
  • Figure 4: Visualization of feature heatmaps of the difference value between adjecent timesteps: original SD (without acceleration), cache-only, and our method. (a): The original SD. (b): DeepCacheDeepCache reduces feature dynamics across timesteps, resulting in a loss of semantic information. (c): Our method effectively restores rich semantic information in features, preserving generation quality while boosting efficiency.
  • Figure 5: Visual comparison on SD1.5 across methods, including caching only (DeepCache DeepCache), token pruning only (ToMeSD sd-tome), and a naive combination of both (DeepCache & ToMeSD).
  • ...and 1 more figures