Table of Contents
Fetching ...

CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models

Xinle Cheng, Zhuoming Chen, Zhihao Jia

TL;DR

The paper tackles the computational bottleneck of diffusion-based text-to-image generation by introducing CAT Pruning, a cluster-aware token pruning method that uses relative noise magnitude, token staleness, and spatial clustering to selectively update tokens during denoising. By caching noise-space outputs and updating only a subset of tokens per iteration, the approach achieves substantial MACs reductions (about $50\%$ at 28 steps and $60\%$ at 50 steps) while preserving image quality, with end-to-end speedups near $1.9\times$. The technique is validated on Stable Diffusion v3 and Pixart-Sigma across PartiPrompts and COCO2017, and it remains compatible with other accelerations like DeepCache to yield further gains. Overall, CAT Pruning provides a practical, scalable way to accelerate diffusion models without compromising perceptual or CLIP-based metrics.

Abstract

Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at https://github.com/ada-cheng/CAT-Pruning

CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models

TL;DR

The paper tackles the computational bottleneck of diffusion-based text-to-image generation by introducing CAT Pruning, a cluster-aware token pruning method that uses relative noise magnitude, token staleness, and spatial clustering to selectively update tokens during denoising. By caching noise-space outputs and updating only a subset of tokens per iteration, the approach achieves substantial MACs reductions (about at 28 steps and at 50 steps) while preserving image quality, with end-to-end speedups near . The technique is validated on Stable Diffusion v3 and Pixart-Sigma across PartiPrompts and COCO2017, and it remains compatible with other accelerations like DeepCache to yield further gains. Overall, CAT Pruning provides a practical, scalable way to accelerate diffusion models without compromising perceptual or CLIP-based metrics.

Abstract

Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis; however, their iterative denoising process demands substantial computational resources. In this paper, we present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge. By employing noise relative magnitude, we identify significant token changes across denoising iterations. Additionally, we enhance token selection by incorporating spatial clustering and ensuring distributional balance. Our experiments demonstrate reveal a 50%-60% reduction in computational costs while preserving the performance of the model, thereby markedly increasing the efficiency of diffusion models. The code is available at https://github.com/ada-cheng/CAT-Pruning

Paper Structure

This paper contains 17 sections, 5 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: CAT Pruning in Stable Diffusion v3. The top row depicts the standard denoising process of Stable Diffusion v3 over 28 inference steps, representing the baseline configuration. The bottom row demonstrates the generative performance of CAT Pruning, which achieves similar generative quality while reducing computation cost by 2$\times$ and end-to-end inference time by 1.90$\times$.
  • Figure 2: Method Overview. At each iteration, tokens are dynamically selected using a combination of the clustering results, noise magnitude, and token staleness. Each part is elaborated in Sec \ref{['sec::noise mag']}, Sec \ref{['sec::balance']}, and Sec \ref{['sec::cluster']}. It is worth noting that we perform clustering only once at step $t_0+1$ to avoid computational overhead.
  • Figure 3: Scatter plot showing the norm of the relative noise at the current step versus the norm of the relative noise at the previous step. We calculate and visualize the Pearson correlation coefficient between these two values.
  • Figure 4: Visualization of Results Based on Noise Magnitude alone. Selecting tokens purely by noise magnitude causes the indices to center around the teddy bear’s body (as shown in the first row), resulting in noticeable noise artifacts (second row) in the background and a lack of smoothness in the predicted noise.
  • Figure 5: Visualization of Results Based on Noise Magnitude and Token Staleness. Incorporating both staleness and noise magnitude in token selection yields a more balanced selection distribution, resulting in improved outputs with notably smoother backgrounds and smoother predicted noises.
  • ...and 4 more figures