Table of Contents
Fetching ...

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis, Vladan Stojnić, Bill Psomas, Nikos Komodakis, Giorgos Tolias

TL;DR

This work introduces a few-shot setting that augments textual prompts with a support set of pixel-annotated images and proposes a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features.

Abstract

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

TL;DR

This work introduces a few-shot setting that augments textual prompts with a support set of pixel-annotated images and proposes a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features.

Abstract

Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.
Paper Structure (31 sections, 13 equations, 19 figures, 5 tables)

This paper contains 31 sections, 13 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Overview of RnS when full textual and visual support is available. Having access to a set of pixel-level annotated images, per-image visual class features$\mathbf{v}^{i}_{c}$ are extracted. These features are then aggregated by class to form visual class features$\mathbf{v}_{c}$, which are combined with textual class features$\mathbf{t}_c$, through a mixing coefficient $\lambda$, to produce fused class features$\mathbf{f}_{c\lambda}$. During test-time training, a test-image-relevant subset of visual support features and fused class features, along with their class labels, are used to train a lightweight linear classifier $g_{\theta}$ using cross-entropy loss. Each training sample is weighted with a class relevance weight $w_{c}$ (e.g.$w_{\hbox{$\blacksquare$}}$ for bg). At inference, this classifier, trained per test image, is applied to patch-level features $\mathbf{x}_j^q$ to generate segmentation predictions. When SAM is available, patch-level features are replaced by region-level features $\mathbf{x}^{\,q}_r$ for improved accuracy.
  • Figure 2: Full textual and visual support. We compare zero-shot, RnS, $\text{kNN-}$CLIP and FreeDA and their variants without class name information (w/o text) for increasing number of support images per class. SAM 2.1 is used for region proposals. Left: OpenCLIP (ViT-B/16) for region-level predictions. Right: DINOv3.txt (ViT-L/16) for patch-level and region-level predictions.
  • Figure 3: Partial visual (left) and textual (right) support settings. Results of zero-shot, RnS, $\text{kNN-}$CLIP, FreeDA and their variants without class name information (w/o text). RnS evaluated w/o the pseudo-label loss in (\ref{['eq:proto-pseudo-loss']}). OpenCLIP ViT-B/16 and SAM 2.1 are used. Left: a fraction of classes lack visual examples, while $B=3$ for the rest. Right: a fraction of classes lack textual class names, and $B=1$.
  • Figure 4: Impact of retrieval on RnS. We replace the retrieved visual support feature set $\mathcal{V}_r$ of RnS with a random subset of the visual support feature set $\mathcal{V}$, or different variants of visual support features from the retrieved classes $\mathcal{V}_{\mathcal{C}_r}$.
  • Figure 5: Comparison in a closed vocabulary setting. We compare RnS to the offline baseline competitors. To ensure a fair comparison we tune the learning rate, batch size, and number of iterations using a train-validation split from the available support images. No mask proposals are used. We report average performance on VOC, ADE, and Stuff.
  • ...and 14 more figures