Table of Contents
Fetching ...

ClustViT: Clustering-based Token Merging for Semantic Segmentation

Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis

TL;DR

ClustViT introduces a semantics-guided token clustering mechanism within a ViT backbone to reduce computation for semantic segmentation. A Cluster module merges semantically similar tokens between Transformer layers, guided by pseudo-clusters derived from segmentation masks, while a Regenerator reconstructs full token representations for downstream heads. Across ADE20K, SUIM, and RumexWeeds, ClustViT achieves significant speedups (up to 1.64x–2.18x improvements in throughput and GFLOPs) with comparable or only modest accuracy changes, especially in background-dominated scenes typical of robotics. The approach offers a practical pathway to deploy efficient ViT-based segmentation in real-world robotic systems by balancing semantic compression with reconstruction fidelity, and it can be paired with standard segmentation heads like Segmenter or UPerNet.

Abstract

Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

ClustViT: Clustering-based Token Merging for Semantic Segmentation

TL;DR

ClustViT introduces a semantics-guided token clustering mechanism within a ViT backbone to reduce computation for semantic segmentation. A Cluster module merges semantically similar tokens between Transformer layers, guided by pseudo-clusters derived from segmentation masks, while a Regenerator reconstructs full token representations for downstream heads. Across ADE20K, SUIM, and RumexWeeds, ClustViT achieves significant speedups (up to 1.64x–2.18x improvements in throughput and GFLOPs) with comparable or only modest accuracy changes, especially in background-dominated scenes typical of robotics. The approach offers a practical pathway to deploy efficient ViT-based segmentation in real-world robotic systems by balancing semantic compression with reconstruction fidelity, and it can be paired with standard segmentation heads like Segmenter or UPerNet.

Abstract

Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.

Paper Structure

This paper contains 17 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of segmentation speed (img/s) across three datasets (ADE20K, SUIM, and RumexWeeds). Each plot shows results for different segmentation backbones: Segmenter (top) and UPerNet (bottom). For each dataset, we compare three models: ViT, CTS, and our model. Across both backbones and all datasets, our model consistently achieves the highest image throughput. The improvements are most pronounced for datasets with few subjects and dominated by background (see ablation study).
  • Figure 2: Examples from the ADE20KzhouSceneParsingADE20K2017 (top), SUIMislamSemanticSegmentationUnderwater2020 (middle), and RumexWeedsRumexWeedsGrasslandDataset (bottom) datasets. Columns: (a) Input image, (b) Ground truth semantic segmentation, (c) Model prediction, (d) Mask for the token clustering generated from the ground truth, (e) Predicted cluster for each token. Starting from the output of (e), regions with the same non‑black color belong to the same cluster and get merged into a single token for the subsequent Transformer layers, while tokens in black regions are kept intact to preserve fine details. Our model configuration used to obtain these results is $\text{ClustViT-b}_{k3, ip3}$.
  • Figure 3: ClustViT overview. The standard Transformer pipeline is executed (center, from bottom to top) through the tokenizer and few Transformer blocks until the Cluster module is encountered. Subsequently, the Transformer backbone proceeds with a reduced amount of tokens. Before being passed to the segmentation head, the tokens are reconstructed by the Regenerator module. Cluster module (left): An MLP predicts the probability of a token belonging to a cluster. Tokens of the same cluster (color coded) are grouped; unclustered (gray) tokens are kept intact. Tokens within each group are aggregated into a single representative token. The reduced token set (cluster representatives + unclustered tokens + CLS) is fed through the remaining Transformer blocks, lowering compute. Regenerator module (right): Takes the reduced sequence. Uses stored assignments to expand each representative back to its original token positions. An MLP refines the reinstated per-token features. Reconstructed full-resolution tokens and preserved unclustered tokens are combined. The restored sequence is delivered to the segmentation head.
  • Figure 4: Distribution of token counts and class diversity across test sets. Each row shows the histogram of tokens used by $\text{ClustViT-b}_{k3, ip3}$ (left) and the average number of classes per image (right). ADE20K exhibits a symmetric token distribution being a dataset with high class diversity, SUIM is moderately left-skewed being of moderate diversity, while RumexWeeds is sharply peaked and is composed of low class diversity images.