Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Heeseong Shin; Chaehyun Kim; Sunghwan Hong; Seokju Cho; Anurag Arnab; Paul Hongsuck Seo; Seungryong Kim

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim

TL;DR

PixelCLIP tackles open-vocabulary semantic segmentation without semantic labels by fine-tuning the CLIP image encoder using unlabeled masks from vision foundation models SAM and DINO. It introduces global semantic clustering of masks with learnable class prompts and uses a momentum encoder to stabilize training, achieving an average of $+16.2$ $mIoU$ over CLIP and competitive results with caption-supervised methods. The method demonstrates strong open-vocabulary segmentation and zero-shot mask classification, while providing extensive ablations and qualitative analyses that validate the effectiveness of unlabeled mask supervision and prompt learnability. This approach enables dense, open-set recognition with existing CLIP-based frameworks and offers a scalable path toward reducing annotation costs in segmentation tasks.

Abstract

Large-scale vision-language models like CLIP have demonstrated impressive open-vocabulary capabilities for image-level tasks, excelling in recognizing what objects are present. However, they struggle with pixel-level recognition tasks like semantic segmentation, which additionally require understanding where the objects are located. In this work, we propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, which is achieved using unlabeled images and masks generated from vision foundation models such as SAM and DINO. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm using learnable class names to acquire general semantic concepts. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods in open-vocabulary semantic segmentation. Project page is available at https://cvlab-kaist.github.io/PixelCLIP

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

TL;DR

over CLIP and competitive results with caption-supervised methods. The method demonstrates strong open-vocabulary segmentation and zero-shot mask classification, while providing extensive ablations and qualitative analyses that validate the effectiveness of unlabeled mask supervision and prompt learnability. This approach enables dense, open-set recognition with existing CLIP-based frameworks and offers a scalable path toward reducing annotation costs in segmentation tasks.

Abstract

Paper Structure (38 sections, 5 equations, 9 figures, 6 tables)

This paper contains 38 sections, 5 equations, 9 figures, 6 tables.

Introduction
Related Work
Open-vocabulary semantic segmentation
Fine-tuning vision-language models for dense prediction
Vision foundation models
Methodology
Preliminaries
Integrating masks into CLIP features
Semantic clustering of masks
Online clustering via learnable class prompts.
Momentum encoder for integrating mask features.
Experiments
Implementation details
Experimental setting
Results
...and 23 more sections

Figures (9)

Figure 1: Illustration of different approaches for open-vocabulary semantic segmentation. In contrast to existing methods utilizing (a) pixel-level semantic labels ding2022decouplingghiasi2022scalingxu2022simplexu2023sidexu2023opencho2024catseg or (b) image-level semantic labels cha2023learningmukhoti2023openluo2023segclipwang2023samxu2022groupvitliu2022open, we leverage unlabeled masks as supervision, which can be freely generated from vision foundation models such as SAM kirillov2023segment and DINO caron2021emerging.
Figure 2: Visualization of masks from vision foundation models. We visualize the masks generated by SAM kirillov2023segment and by clustering image features from DINO caron2021emerging. Although such models can freely generate fine-grained masks, the resulting masks can be too small or incomplete to have semantic meaning. To address this over-segmentation issue, we employ online clustering caron2020unsupervised of the masks into semantically meaningful groups defined globally for given images.
Figure 3: Illustration of our overall framework. We provide illustration of PixelCLIP, utilizing unlabeled images and masks for fine-tuning the image encoder of CLIP, enabling open-vocabulary semantic segmentation. We note that the momentum image encoder and the mask decoder are only leveraged during training, and inference is only done with image and text encoders of CLIP.
Figure 4: Comparison between PixelCLIP and CLIP. We provide qualitative comparison on ADE-20K zhou2019semantic dataset with PixelCLIP and CLIP. We demonstrate the dense visual recognition capabilities achieved from fine-tuning CLIP, whereas CLIP shows results with significant noise.
Figure 5: Visualization of learned class prompts. We visualize the text features from our learned class prompts, as well as text features from classnames of COCO-Stuff with $t$-SNE visualization in (a-b). We also visualize images inferenced with the learned class prompts in (c-d).
...and 4 more figures

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

TL;DR

Abstract

Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels

Authors

TL;DR

Abstract

Table of Contents

Figures (9)