Agglomerative Token Clustering
Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund
TL;DR
This work addresses the quadratic cost of self-attention in Vision Transformers by introducing Agglomerative Token Clustering (ATC), a parameter-free, bottom-up token merging method that operates between the self-attention and MLP blocks. ATC leverages classical hierarchical clustering with cosine distance and multiple linkage options to merge similar tokens, producing a reduced token set without learnable parameters. Across image classification, image synthesis, and object detection/segmentation, ATC achieves state-of-the-art results among token reduction methods and often matches or surpasses fully-tuned baselines, with pronounced gains at low keep rates. The method is versatile and generalizable across diverse datasets and tasks, offering a practical path toward more efficient ViTs with minimal loss in performance, while also highlighting areas for acceleration and future improvement in clustering implementations.
Abstract
We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
