Cached Adaptive Token Merging: Dynamic Token Reduction and Redundant Computation Elimination in Diffusion Model
Omid Saghatchian, Atiyeh Gh. Moghadam, Ahmad Nickabadi
TL;DR
This work tackles the high computational burden of diffusion models by introducing Cached Adaptive Token Merging (CA-ToMe), which combines an adaptive similarity-threshold merging strategy with a caching mechanism to reduce self-attention cost in the top-layer U-Net transformers. By replacing fixed merging rates with a threshold $t$ and reusing token-pair decisions across steps, CA-ToMe achieves a notable speedup ($\approx1.24\times$) in denoising while maintaining image quality close to or better than prior token-merging approaches. The approach focuses on the high-cost $D_1$/$U_1$ blocks and leverages the smoothness of token changes across timesteps, validated by experiments on ImageNet with Stable Diffusion v1.5. The key contributions are (i) adaptive merging based on similarity distributions within a $2\times2$ stride, (ii) a token-pair caching scheme using checkpointed computations, and (iii) empirical demonstration of training-free acceleration with minimal fidelity loss, suggesting practical applicability to diffusion-based generation pipelines.
Abstract
Diffusion models have emerged as a promising approach for generating high-quality, high-dimensional images. Nevertheless, these models are hindered by their high computational cost and slow inference, partly due to the quadratic computational complexity of the self-attention mechanisms with respect to input size. Various approaches have been proposed to address this drawback. One such approach focuses on reducing the number of tokens fed into the self-attention, known as token merging (ToMe). In our method, which is called cached adaptive token merging(CA-ToMe), we calculate the similarity between tokens and then merge the r proportion of the most similar tokens. However, due to the repetitive patterns observed in adjacent steps and the variation in the frequency of similarities, we aim to enhance this approach by implementing an adaptive threshold for merging tokens and adding a caching mechanism that stores similar pairs across several adjacent steps. Empirical results demonstrate that our method operates as a training-free acceleration method, achieving a speedup factor of 1.24 in the denoising process while maintaining the same FID scores compared to existing approaches.
