Table of Contents
Fetching ...

Exploring Simple Open-Vocabulary Semantic Segmentation

Zihang Lai

TL;DR

This work tackles open-vocabulary semantic segmentation by eliminating dependence on manually annotated masks or CLIP-style pretraining. It introduces S-Seg, which trains a MaskFormer model using pseudo-masks generated from self-supervised clustering (via DINO features and K-Means) and language supervision from image-text contrastive loss, enabling training solely from publicly available image-text datasets. A key contribution is decoupling mask supervision from language supervision and employing self-training (S-Seg+) to further boost performance, with strong generalization across VOC, Context, COCO, LVIS, and ImageNet-S. The approach yields competitive results, scales well with more data, and provides a simple, robust baseline that reduces reliance on heavy pretraining and extensive labeled masks, offering a practical path for future open-vocabulary segmentation research.

Abstract

Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improvement when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research.

Exploring Simple Open-Vocabulary Semantic Segmentation

TL;DR

This work tackles open-vocabulary semantic segmentation by eliminating dependence on manually annotated masks or CLIP-style pretraining. It introduces S-Seg, which trains a MaskFormer model using pseudo-masks generated from self-supervised clustering (via DINO features and K-Means) and language supervision from image-text contrastive loss, enabling training solely from publicly available image-text datasets. A key contribution is decoupling mask supervision from language supervision and employing self-training (S-Seg+) to further boost performance, with strong generalization across VOC, Context, COCO, LVIS, and ImageNet-S. The approach yields competitive results, scales well with more data, and provides a simple, robust baseline that reduces reliance on heavy pretraining and extensive labeled masks, offering a practical path for future open-vocabulary segmentation research.

Abstract

Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improvement when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research.
Paper Structure (27 sections, 4 equations, 18 figures, 10 tables)

This paper contains 27 sections, 4 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1: S-Seg result on a web image. Our goal is to segment everything, including fictional characters like minions.
  • Figure 2: Our S-Seg framework leverages pseudo-mask and language to train a MaskFormer. We show that our method of directly training for pixel-level feature and language alignment yields superior results.
  • Figure 3: Qualitative results of S-Seg, evaluated using all dataset classes as queries. Our model copes with challenging situation, such as overlapping objects (col. 2) and small objects (col. 5). Our model is also capable of handling "stuff" categories such as water and floor (col. 3, 4). Moreover, our S-Seg+ model is able to correct small errors observed in the S-Seg method (col. 4). Finally, in the COCO dataset, which featured a significantly higher number of objects, our model is still able to achieve high accuracy in its predictions.
  • Figure 4: Pseudocode for training S-Seg with image-text pairs.
  • Figure 5: Overview of S-Seg. A MaskFormer model computes masks and mask features from an image input. A pseudo-mask generator produces segmentation maps to supervise mask predictions, while a text that describes the image, encoded by a language model trained together with the MaskFormer, provides supervision for mask features using image-text contrastive loss.
  • ...and 13 more figures