Table of Contents
Fetching ...

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

Yau Shing Jonathan Cheung, Xi Chen, Lihe Yang, Hengshuang Zhao

TL;DR

This work tackles unsupervised semantic segmentation by introducing LightCluster, a network-free framework that leverages foreground-background differentiability in self-supervised Vision Transformer attention features. It performs multilevel clustering at Dataset-level, Category-level, and Image-level to generate high-quality patch-level pseudo-masks, followed by upsampling, refinement, and CLS-token-based class assignment. The approach achieves state-of-the-art unsupervised results on PASCAL VOC and MS COCO while dramatically reducing computational requirements by avoiding model training. The paper also provides a thorough analysis comparing DINO and DINOv2 features and demonstrates the practical advantages of attention-based clustering for efficient, scalable segmentation.

Abstract

Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model accuracy, most required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thus propose a lightweight clustering framework for unsupervised semantic segmentation. We discovered that attention features of the self-supervised Vision Transformer exhibit strong foreground-background differentiability. Therefore, clustering can be employed to effectively separate foreground and background image patches. In our framework, we first perform multilevel clustering across the Dataset-level, Category-level, and Image-level, and maintain consistency throughout. Then, the binary patch-level pseudo-masks extracted are upsampled, refined and finally labeled. Furthermore, we provide a comprehensive analysis of the self-supervised Vision Transformer features and a detailed comparison between DINO and DINOv2 to justify our claims. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.

A Lightweight Clustering Framework for Unsupervised Semantic Segmentation

TL;DR

This work tackles unsupervised semantic segmentation by introducing LightCluster, a network-free framework that leverages foreground-background differentiability in self-supervised Vision Transformer attention features. It performs multilevel clustering at Dataset-level, Category-level, and Image-level to generate high-quality patch-level pseudo-masks, followed by upsampling, refinement, and CLS-token-based class assignment. The approach achieves state-of-the-art unsupervised results on PASCAL VOC and MS COCO while dramatically reducing computational requirements by avoiding model training. The paper also provides a thorough analysis comparing DINO and DINOv2 features and demonstrates the practical advantages of attention-based clustering for efficient, scalable segmentation.

Abstract

Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model accuracy, most required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thus propose a lightweight clustering framework for unsupervised semantic segmentation. We discovered that attention features of the self-supervised Vision Transformer exhibit strong foreground-background differentiability. Therefore, clustering can be employed to effectively separate foreground and background image patches. In our framework, we first perform multilevel clustering across the Dataset-level, Category-level, and Image-level, and maintain consistency throughout. Then, the binary patch-level pseudo-masks extracted are upsampled, refined and finally labeled. Furthermore, we provide a comprehensive analysis of the self-supervised Vision Transformer features and a detailed comparison between DINO and DINOv2 to justify our claims. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
Paper Structure (26 sections, 3 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 3 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: With simple clustering of attention features, we can obtain accurate pseudo-mask predictions. Dataset-level, Category-level, and Image-level masks are extracted by clustering features within the same dataset, superclass, and image, respectively. Masks at different levels each have their own strengths and weaknesses. In simple backgrounds, all masks deliver accurate predictions. However, segmentation becomes more challenging with complex backgrounds. Both Dataset-level and Category-level masks provide a precise estimation of object structure, whereas Image-level masks, though the coarsest, can be used to 1) identify the foreground class in Dataset-level and Category-level masks and 2) remove noise. By ensuring multilevel clustering consistency, we can obtain high-quality patch-level binary pseudo-masks that are ready for post-processing.
  • Figure 2: Illustration of our Lightweight Clustering Framework. We first utilise the self-supervised Vision Transformer to extract image patch features. Then, we perform clustering at the Image-level, Category-level, and Dataset-level. We further ensure multilevel clustering consistency and extract the binary patch-level pseudo-mask. The mask is then upsampled and refined accordingly. Finally, object regions are cropped and clustered into their respective classes.
  • Figure 3: PCA visualization of 'Key' attention features of the PASCAL VOC validation dataset. We first showcase the Principal Component Analysis visualization of 'Key' attention features corresponding to foreground and background patches from the ground-truth PASCAL VOC validation dataset. Blue corresponds to foreground image patches, while red represents the background. Additionally, we present the results achieved by clustering the attention features into two, three, and four clusters. Finally, we display the feature distribution achieved through our multilevel clustering framework.
  • Figure 4: Qualitative results on the PASCAL VOC 2012 dataset. We display the segmentation results of our framework in comparison with the current state-of-the-art method COMUS comus after two rounds of self-training.
  • Figure 5: Qualitative results using patch tokens for class assignments on the PASCAL VOC 2012 val dataset. We present visualizations of clustering foreground patch tokens and 'Key' attention features into 20 classes. The results demonstrate that clustering based on patch tokens can accurately identify object parts, while clustering using attention features is unable to do so. The discovered object parts include human heads (orange), ears of animals (yellow), mouths of animals (blue), and legs/hands of animals (green). The bodies of horses are grouped into the pink cluster, whereas the bodies of other animals, which share a similar texture, are categorized into the purple cluster.
  • ...and 4 more figures