Table of Contents
Fetching ...

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

TL;DR

PixelCLIP tackles open-vocabulary semantic segmentation without semantic labels by fine-tuning the CLIP image encoder using unlabeled masks from vision foundation models SAM and DINO. It introduces global semantic clustering of masks with learnable class prompts and uses a momentum encoder to stabilize training, achieving an average of $+16.2$ $mIoU$ over CLIP and competitive results with caption-supervised methods. The method demonstrates strong open-vocabulary segmentation and zero-shot mask classification, while providing extensive ablations and qualitative analyses that validate the effectiveness of unlabeled mask supervision and prompt learnability. This approach enables dense, open-set recognition with existing CLIP-based frameworks and offers a scalable path toward reducing annotation costs in segmentation tasks.

Abstract

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation. Project page is available at https://cvlab-kaist.github.io/PixelCLIP

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

TL;DR

PixelCLIP tackles open-vocabulary semantic segmentation without semantic labels by fine-tuning the CLIP image encoder using unlabeled masks from vision foundation models SAM and DINO. It introduces global semantic clustering of masks with learnable class prompts and uses a momentum encoder to stabilize training, achieving an average of over CLIP and competitive results with caption-supervised methods. The method demonstrates strong open-vocabulary segmentation and zero-shot mask classification, while providing extensive ablations and qualitative analyses that validate the effectiveness of unlabeled mask supervision and prompt learnability. This approach enables dense, open-set recognition with existing CLIP-based frameworks and offers a scalable path toward reducing annotation costs in segmentation tasks.

Abstract

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation. Project page is available at https://cvlab-kaist.github.io/PixelCLIP
Paper Structure (38 sections, 5 equations, 9 figures, 6 tables)

This paper contains 38 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Illustration of different approaches for open-vocabulary semantic segmentation. In contrast to existing methods utilizing (a) pixel-level semantic labels ding2022decouplingghiasi2022scalingxu2022simplexu2023sidexu2023opencho2024catseg or (b) image-level semantic labels cha2023learningmukhoti2023openluo2023segclipwang2023samxu2022groupvitliu2022open, we leverage unlabeled masks as supervision, which can be freely generated from vision foundation models such as SAM kirillov2023segment and DINO caron2021emerging.
  • Figure 2: Visualization of masks from vision foundation models. We visualize the masks generated by SAM kirillov2023segment and by clustering image features from DINO caron2021emerging. Although such models can freely generate fine-grained masks, the resulting masks can be too small or incomplete to have semantic meaning. To address this over-segmentation issue, we employ online clustering caron2020unsupervised of the masks into semantically meaningful groups defined globally for given images.
  • Figure 3: Illustration of our overall framework. We provide illustration of PixelCLIP, utilizing unlabeled images and masks for fine-tuning the image encoder of CLIP, enabling open-vocabulary semantic segmentation. We note that the momentum image encoder and the mask decoder are only leveraged during training, and inference is only done with image and text encoders of CLIP.
  • Figure 4: Comparison between PixelCLIP and CLIP. We provide qualitative comparison on ADE-20K zhou2019semantic dataset with PixelCLIP and CLIP. We demonstrate the dense visual recognition capabilities achieved from fine-tuning CLIP, whereas CLIP shows results with significant noise.
  • Figure 5: Visualization of learned class prompts. We visualize the text features from our learned class prompts, as well as text features from classnames of COCO-Stuff with $t$-SNE visualization in (a-b). We also visualize images inferenced with the learned class prompts in (c-d).
  • ...and 4 more figures