Exploring Simple Open-Vocabulary Semantic Segmentation
Zihang Lai
TL;DR
This work tackles open-vocabulary semantic segmentation by eliminating dependence on manually annotated masks or CLIP-style pretraining. It introduces S-Seg, which trains a MaskFormer model using pseudo-masks generated from self-supervised clustering (via DINO features and K-Means) and language supervision from image-text contrastive loss, enabling training solely from publicly available image-text datasets. A key contribution is decoupling mask supervision from language supervision and employing self-training (S-Seg+) to further boost performance, with strong generalization across VOC, Context, COCO, LVIS, and ImageNet-S. The approach yields competitive results, scales well with more data, and provides a simple, robust baseline that reduces reliance on heavy pretraining and extensive labeled masks, offering a practical path for future open-vocabulary segmentation research.
Abstract
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, and (iii) custom grouping encoders. In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improvement when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research.
