Table of Contents
Fetching ...

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li

TL;DR

This paper addresses open-vocabulary semantic segmentation by extracting accurate object masks from off-the-shelf vision-language models without any additional training. The proposed Plug-and-Play OVSS (PnP-OVSS) combines cross-attention maps, GradCAM-style sharpening using the ITM loss, and Salience DropOut to progressively reveal complete object extents, with Gaussian blur and Dense CRF for refinement. Hyperparameters are tuned via a CLIP-based weak reward, enabling zero-shot optimization without dense pixel annotations. Empirically, PnP-OVSS achieves substantial gains over training-free baselines across multiple datasets and backbones, highlighting a scalable direction for OVSS that leverages existing VLMs against open vocabularies.

Abstract

From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

TL;DR

This paper addresses open-vocabulary semantic segmentation by extracting accurate object masks from off-the-shelf vision-language models without any additional training. The proposed Plug-and-Play OVSS (PnP-OVSS) combines cross-attention maps, GradCAM-style sharpening using the ITM loss, and Salience DropOut to progressively reveal complete object extents, with Gaussian blur and Dense CRF for refinement. Hyperparameters are tuned via a CLIP-based weak reward, enabling zero-shot optimization without dense pixel annotations. Empirically, PnP-OVSS achieves substantial gains over training-free baselines across multiple datasets and backbones, highlighting a scalable direction for OVSS that leverages existing VLMs against open vocabularies.

Abstract

From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.
Paper Structure (23 sections, 4 equations, 6 figures, 8 tables)

This paper contains 23 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Qualitative Results of PnP-OVSS + BLIP. Images are from Pascal VOC and COCO Object. The right columns and bottom rows show the ground-truth (GT); the rest are our results. Note the good results on small objects like the frisbee and the tennis racket.
  • Figure 2: Segmentation masks for elephant and giraffe using (a) off-the-shelf cross-attention, (b) cross-attention + GradCAM, and (c) cross-attention + GradCam + Salience DropOut (§ \ref{['sec:salience-dropout']}). The naive cross-attention masks are too inclusive whereas GradCAM is too exclusive.
  • Figure 3: The first iteration with cross-attention + GradCAM + Salience DropOut. The text prompt contains $K$ class names and the image contains $P\times P$ patches. From a cross-attention layer and an attention head in the pretrained VLM, we obtain $K$ attention score maps of size $P\times P$, which are sharpened by GradCAM using gradients from the image-text-matching (ITM) loss. To get more complete predictions, we perform Salience Dropout, which repeatedly zero out image patches of the highest average scores and feeds the remaining patches to the image encoder again, forcing the model to attend to other less discriminative patches. We show example salience maps from all iterations in Fig. \ref{['fig:short-b']}.
  • Figure 4: An illustration of Salience DropOut, showing GradCAM salience values after each iteration. Black squares in the images indicate dropped patches. We obtain the final result by summing the salience maps from all iterations and applying thresholding, Gaussian blur, and Dense CRF.
  • Figure 5: PnP-OVSS+BLIP$_{\text{Flickr}}$ segmentation result for in the wild images.
  • ...and 1 more figures