A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Thomas Stegmüller, Tim Lebailly, Nikola Dukic, Behzad Bozorgtabar, Tinne Tuytelaars, Jean-Philippe Thiran
TL;DR
This paper tackles the challenge of open-vocabulary zero-shot segmentation by decoupling visual representation learning from cross-modal alignment and leveraging frozen vision backbones with spatial awareness. It introduces SimZSS, a simple framework that identifies textual concepts via noun phrases in captions, maps them into the visual space, retrieves corresponding visual concepts through similarity-based pooling, and enforces both global and concept-level consistency losses, with the total loss $\mathcal{L}_{\mathrm{tot}} = \mathcal{L}_{\mathrm{g}} + \lambda \mathcal{L}_{\mathrm{l}}$. The approach achieves state-of-the-art results on 7 of 8 segmentation benchmarks (trained on COCO Captions) in under 15 minutes on 8 GPUs and demonstrates robustness to both curated and noisy data (LAION-400M), while requiring minimal hyperparameters. By validating with extensive ablations, including the necessity of a concept bank, backbone choices, and resolution effects, the work shows that open-vocabulary segmentation can reach high performance without heavy mask supervision or cross-modal projection, making it both data- and compute-efficient for practical deployment.
Abstract
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
