Table of Contents
Fetching ...

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

Thomas Stegmüller, Tim Lebailly, Nikola Dukic, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran

TL;DR

This paper tackles the challenge of open-vocabulary zero-shot segmentation by decoupling visual representation learning from cross-modal alignment and leveraging frozen vision backbones with spatial awareness. It introduces SimZSS, a simple framework that identifies textual concepts via noun phrases in captions, maps them into the visual space, retrieves corresponding visual concepts through similarity-based pooling, and enforces both global and concept-level consistency losses, with the total loss $\mathcal{L}_{\mathrm{tot}} = \mathcal{L}_{\mathrm{g}} + \lambda \mathcal{L}_{\mathrm{l}}$. The approach achieves state-of-the-art results on 7 of 8 segmentation benchmarks (trained on COCO Captions) in under 15 minutes on 8 GPUs and demonstrates robustness to both curated and noisy data (LAION-400M), while requiring minimal hyperparameters. By validating with extensive ablations, including the necessity of a concept bank, backbone choices, and resolution effects, the work shows that open-vocabulary segmentation can reach high performance without heavy mask supervision or cross-modal projection, making it both data- and compute-efficient for practical deployment.

Abstract

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.

A Simple Framework for Open-Vocabulary Zero-Shot Segmentation

TL;DR

This paper tackles the challenge of open-vocabulary zero-shot segmentation by decoupling visual representation learning from cross-modal alignment and leveraging frozen vision backbones with spatial awareness. It introduces SimZSS, a simple framework that identifies textual concepts via noun phrases in captions, maps them into the visual space, retrieves corresponding visual concepts through similarity-based pooling, and enforces both global and concept-level consistency losses, with the total loss . The approach achieves state-of-the-art results on 7 of 8 segmentation benchmarks (trained on COCO Captions) in under 15 minutes on 8 GPUs and demonstrates robustness to both curated and noisy data (LAION-400M), while requiring minimal hyperparameters. By validating with extensive ablations, including the necessity of a concept bank, backbone choices, and resolution effects, the work shows that open-vocabulary segmentation can reach high performance without heavy mask supervision or cross-modal projection, making it both data- and compute-efficient for practical deployment.

Abstract

Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
Paper Structure (44 sections, 10 equations, 5 figures, 13 tables)

This paper contains 44 sections, 10 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Visualization of the patch-level representations and text concepts in the RGB color space. PCA is used to map the dense representation of a single image into a three-dimensional space. The three-dimensional representations (color) of the text concepts from the concept bank and the corresponding caption are obtained and shown for each image. Each row includes the original image and dense feature visualization at different resolutions. These include the training resolution ($16\times16$) and higher resolutions ($2\times$, $4\times$, and $8\times$).
  • Figure 2: Overview of $\text{SimZSS}$. On the text side (a.), each concept in the caption is represented using a trainable text encoder. On the vision side (c.), visual representations of each concept are obtained via a similarity-based pooling of the visual tokens. These visual concept representations are then projected onto a linear classifier, with weights derived from the text concept representations of the current batch. Cross-modality consistency is enforced using cross-entropy loss (b.).
  • Figure 3: Vision-language alignment of text concepts and dense visual representations. Concepts present in the image are embedded independently by the text encoder and then projected onto the representations of each patch within the image. The images are processed at a resolution of $896\times896$ pixels, corresponding to $4\times$ the training resolution. The alignment is performed on LAION-400M using a ViT-B/14 as the vision tower.
  • Figure 4: Zero-shot segmentation performance as a function of the number of processed image-caption pairs in LAION-400M. The left plot shows the mIoU percentages for different datasets, while the right plot shows the relative performance percentages. Each data point represents the result of running a vision-language alignment from scratch using $\text{SimZSS}$; these are not training curves.
  • Figure 5: Zero-shot classification benchmark. We report the top-1 accuracy of SimZSS with and without the concept bank on 38 evaluation datasets when trained on COCO Captions and LAION-400M. Additional comparison with LiT is provided.