A Lightweight Clustering Framework for Unsupervised Semantic Segmentation
Yau Shing Jonathan Cheung, Xi Chen, Lihe Yang, Hengshuang Zhao
TL;DR
This work tackles unsupervised semantic segmentation by introducing LightCluster, a network-free framework that leverages foreground-background differentiability in self-supervised Vision Transformer attention features. It performs multilevel clustering at Dataset-level, Category-level, and Image-level to generate high-quality patch-level pseudo-masks, followed by upsampling, refinement, and CLS-token-based class assignment. The approach achieves state-of-the-art unsupervised results on PASCAL VOC and MS COCO while dramatically reducing computational requirements by avoiding model training. The paper also provides a thorough analysis comparing DINO and DINOv2 features and demonstrates the practical advantages of attention-based clustering for efficient, scalable segmentation.
Abstract
Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model accuracy, most required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thus propose a lightweight clustering framework for unsupervised semantic segmentation. We discovered that attention features of the self-supervised Vision Transformer exhibit strong foreground-background differentiability. Therefore, clustering can be employed to effectively separate foreground and background image patches. In our framework, we first perform multilevel clustering across the Dataset-level, Category-level, and Image-level, and maintain consistency throughout. Then, the binary patch-level pseudo-masks extracted are upsampled, refined and finally labeled. Furthermore, we provide a comprehensive analysis of the self-supervised Vision Transformer features and a detailed comparison between DINO and DINOv2 to justify our claims. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
