Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

Omid Saghatchian; Atiyeh Gh. Moghadam; Ahmad Nickabadi

Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

Omid Saghatchian, Atiyeh Gh. Moghadam, Ahmad Nickabadi

TL;DR

This work tackles the high computational burden of diffusion models by introducing Cached Adaptive Token Merging (CA-ToMe), which combines an adaptive similarity-threshold merging strategy with a caching mechanism to reduce self-attention cost in the top-layer U-Net transformers. By replacing fixed merging rates with a threshold $t$ and reusing token-pair decisions across steps, CA-ToMe achieves a notable speedup ($\approx1.24\times$) in denoising while maintaining image quality close to or better than prior token-merging approaches. The approach focuses on the high-cost $D_1$/$U_1$ blocks and leverages the smoothness of token changes across timesteps, validated by experiments on ImageNet with Stable Diffusion v1.5. The key contributions are (i) adaptive merging based on similarity distributions within a $2\times2$ stride, (ii) a token-pair caching scheme using checkpointed computations, and (iii) empirical demonstration of training-free acceleration with minimal fidelity loss, suggesting practical applicability to diffusion-based generation pipelines.

Abstract

Diffusion models have emerged as a promising approach for generating high-quality, high-dimensional images. Nevertheless, these models are hindered by their high computational cost and slow inference, partly due to the quadratic computational complexity of the self-attention mechanisms with respect to input size. Various approaches have been proposed to address this drawback. One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe). In our method, which is called cached adaptive token merging(CA-ToMe), we calculate the similarity between tokens and then merge the r proportion of the most similar tokens. However, due to the repetitive patterns observed in adjacent steps and the variation in the frequency of similarities, we aim to enhance this approach by implementing an adaptive threshold for merging tokens and adding a caching mechanism that stores similar pairs across several adjacent steps. Empirical results demonstrate that our method operates as a training-free acceleration method, achieving a speedup factor of 1.24 in the denoising process while maintaining the same FID scores compared to existing approaches.

Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

TL;DR

and reusing token-pair decisions across steps, CA-ToMe achieves a notable speedup (

) in denoising while maintaining image quality close to or better than prior token-merging approaches. The approach focuses on the high-cost

blocks and leverages the smoothness of token changes across timesteps, validated by experiments on ImageNet with Stable Diffusion v1.5. The key contributions are (i) adaptive merging based on similarity distributions within a

stride, (ii) a token-pair caching scheme using checkpointed computations, and (iii) empirical demonstration of training-free acceleration with minimal fidelity loss, suggesting practical applicability to diffusion-based generation pipelines.

Abstract

Paper Structure (8 sections, 1 equation, 4 figures, 4 tables)

This paper contains 8 sections, 1 equation, 4 figures, 4 tables.

Introduction
Background
Methodology
Improved Token Merging through Similarity Distribution Analysis
Reducing Redundant Computations through Token Pair Caching
Implementation Details
Experiments
Conclusion

Figures (4)

Figure 1: A comparison between the images generated by three different models. (First Row) Examples of images generated using SDv1.5[rombach2022high] without any token reduction, (Second row) Token merging[bolya2022token] with $r = 50\%$. (Third row) Our method(CA-ToMe) with $Threshold=0.7$.This comparison illustrates that our method works better in background details.
Figure 2: This figure shows a comparison between different scenarios in the frequency of similarity values in a self-attention block. The plots show the histogram of similarities between input tokens. The blue bars show the similarities that aren’t in the $r$ most similar tokens so that they won’t be merged. (A) Shows the scenario where most of the tokens are not similar, so when we use a constant merging rate, we are allowing our method to merge dissimilar tokens, which can lead to information loss. However, when we determine a threshold, as shown in the figure, it tends to merge fewer tokens and only select similar ones. (B) Shows the scenario where most of the tokens are similar(A typical scenario in the first steps of denoising). If we use a constant merging rate, we are forcing our method to choose tokens that are less similar than it could choose without any damage to quality. Again, in this scenario, selecting a threshold for merging can lead to merging more tokens and speeding up the inference.
Figure 3: This figure demonstrates the whole scheme of pair caching. The above U-shaped blocks show denoising U-nets with the upblocks and downblocks. In each of these blocks, there exist some transformer blocks containing attention mechanisms. In some timesteps, which are illustrated with blue boxes, we calculate the whole process of token merging, but in other timesteps, which are illustrated with gray boxes, we just use the same pairs from the previous timestep to do token merging. In both the blue and gray boxes, there exists a bipartite graph where the left side represents source tokens and the right side represents destination tokens, depicted in different colors. Additionally, the different colors of tokens across the boxes indicate that they do not have the same values.
Figure 4: The Jaccard distance between pairs in adjacent steps across different transformer layers within $D_1$ and $U_1$ blocks. This figure is plotted using 100 photos generated using 100 classes of ImageNet. The shaded regions around the curves represent the variance in the Jaccard distance across these images. As illustrated in the figure, most intermediate steps exhibit a distance of less than 0.2, indicating high similarity among pair sets.

Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

TL;DR

Abstract

Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)