ClustViT: Clustering-based Token Merging for Semantic Segmentation
Fabio Montello, Ronja Güldenring, Lazaros Nalpantidis
TL;DR
ClustViT introduces a semantics-guided token clustering mechanism within a ViT backbone to reduce computation for semantic segmentation. A Cluster module merges semantically similar tokens between Transformer layers, guided by pseudo-clusters derived from segmentation masks, while a Regenerator reconstructs full token representations for downstream heads. Across ADE20K, SUIM, and RumexWeeds, ClustViT achieves significant speedups (up to 1.64x–2.18x improvements in throughput and GFLOPs) with comparable or only modest accuracy changes, especially in background-dominated scenes typical of robotics. The approach offers a practical pathway to deploy efficient ViT-based segmentation in real-world robotic systems by balancing semantic compression with reconstruction fidelity, and it can be paired with standard segmentation heads like Segmenter or UPerNet.
Abstract
Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have focused on dynamically merging tokens according to the image complexity. Token merging works well for classification but is less suited to dense prediction. We propose ClustViT, where we expand upon the Vision Transformer (ViT) backbone and address semantic segmentation. Within our architecture, a trainable Cluster module merges similar tokens along the network guided by pseudo-clusters from segmentation masks. Subsequently, a Regenerator module restores fine details for downstream heads. Our approach achieves up to 2.18x fewer GFLOPs and 1.64x faster inference on three different datasets, with comparable segmentation accuracy. Our code and models will be made publicly available.
