Table of Contents
Fetching ...

Agglomerative Token Clustering

Joakim Bruslund Haurum, Sergio Escalera, Graham W. Taylor, Thomas B. Moeslund

TL;DR

This work addresses the quadratic cost of self-attention in Vision Transformers by introducing Agglomerative Token Clustering (ATC), a parameter-free, bottom-up token merging method that operates between the self-attention and MLP blocks. ATC leverages classical hierarchical clustering with cosine distance and multiple linkage options to merge similar tokens, producing a reduced token set without learnable parameters. Across image classification, image synthesis, and object detection/segmentation, ATC achieves state-of-the-art results among token reduction methods and often matches or surpasses fully-tuned baselines, with pronounced gains at low keep rates. The method is versatile and generalizable across diverse datasets and tasks, offering a practical path toward more efficient ViTs with minimal loss in performance, while also highlighting areas for acceleration and future improvement in clustering implementations.

Abstract

We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.

Agglomerative Token Clustering

TL;DR

This work addresses the quadratic cost of self-attention in Vision Transformers by introducing Agglomerative Token Clustering (ATC), a parameter-free, bottom-up token merging method that operates between the self-attention and MLP blocks. ATC leverages classical hierarchical clustering with cosine distance and multiple linkage options to merge similar tokens, producing a reduced token set without learnable parameters. Across image classification, image synthesis, and object detection/segmentation, ATC achieves state-of-the-art results among token reduction methods and often matches or surpasses fully-tuned baselines, with pronounced gains at low keep rates. The method is versatile and generalizable across diverse datasets and tasks, offering a practical path toward more efficient ViTs with minimal loss in performance, while also highlighting areas for acceleration and future improvement in clustering implementations.

Abstract

We present Agglomerative Token Clustering (ATC), a novel token merging method that consistently outperforms previous token merging and pruning methods across image classification, image synthesis, and object detection & segmentation tasks. ATC merges clusters through bottom-up hierarchical clustering, without the introduction of extra learnable parameters. We find that ATC achieves state-of-the-art performance across all tasks, and can even perform on par with prior state-of-the-art when applied off-the-shelf, i.e. without fine-tuning. ATC is particularly effective when applied with low keep rates, where only a small fraction of tokens are kept and retaining task performance is especially difficult.
Paper Structure (25 sections, 5 equations, 21 figures, 8 tables)

This paper contains 25 sections, 5 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Illustration of the Agglomerative Clustering Method. Prior hard merging-based methodologies have focused on using either partition-based approaches (e.g. DPC-KNN DPCKNN_2022 or K-Medoids Marin_2023) or graph-based (i.e. ToMe). All of these methods globally cluster the input tokens through the use of cluster centers. In contrast, our Agglomerative Token Clustering (ATC) method builds clusters locally, i.e. by iteratively combining the most similar tokens, until the desired amount of tokens remain. A step of this process is shown here, where a graph of nodes (in this case, tokens) are connected with edges based on their similarity. The most similar pair of nodes are combined, and the edges are updated using linkage function $D$, in this case $D^{\text{complete}}$ (Eq. \ref{['eq:complete']}).
  • Figure 2: Average Token Reduction Rank (Lower is better). We compare our proposed ATC method with the hard-merging based token reduction methods investigated by Haurum et al.Haurum_2023_ICCV. We average across four keep rates, three model capacities, and four datasets, and plot with $\pm 1$ standard deviation similar to Haurum et al. We test three versions of ATC, varying the linkage function, and find that the three variants all outperform the prior merging-based methods.
  • Figure 3: Percentage Point Difference per Keep Rate. We compute the average difference between our proposed ATC method and the best prior merging-based methods investigated by Haurum et al.Haurum_2023_ICCV for each keep rate, measured in percentage points. We average across the three model capacities and four datasets. We find that for high keep rates $r=\{70, 90\}\%$ ATC is comparable to the prior best merging method, while for $r=\{25,50\}\%$ our proposed ATC method leads to significant performance gains.
  • Figure 4: Hard Token Merging Method Comparison with the DeiT Backbone. We compare the hard-merging token reduction methods considered by Haurum et al.Haurum_2023_ICCV with the proposed ATC method. All methods have been fine-tuned. Model performance is measured across keep rates, $r$, denoted in percentage of tokens kept at each reduction stage, and with the DeiT-{Tiny, Small, Base} models. Comparison with all 13 token reduction methods considered by Haurum et al. can be found in the supplementary materials. ImageNet and NABirds performance is measured with top-1 accuracy, whereas COCO and NUS-WIDE is measured with mAP. The baseline DeiT performance is noted with a dashed black line. Note that ToMe is limited to $r\geq 50\%$, and that ATC$^{\text{average}}$ and ATC$^{\text{complete}}$ often overlap.
  • Figure 5: Token Merging Visualization with $r=25$%. We visualize the three token merging steps for DPC-KNN DPCKNN_2022, K-Medoids Marin_2023 and our ATC$^{\text{average}}$ on two examples from NABirds NABirds with a DeiT-B backbone. The first row is the input image, and each subsequent row is the constructed clusters after the first, second, and third reduction stage. In subfigure (a) we find that there is a major difference in the final clustering of the data, where our ATC method creates separate clusters for the bird, wood pole, and background. In contrast, DPC-KNN and K-Medoids create mostly arbitrary clusters. Similarly in subfigure (b) we see that the DPC-KNN method creates very arbitrary clusters, while K-Medoids and ATC create more meaningful clusters. However, the ATC clusters still better contain the bird in the image, while the K-Medoids clusters have background patches in all clusters. We find this to be a repeating occurrence and believe this is the reason for the large improvement by ATC on NABirds at $r=25$%.
  • ...and 16 more figures