Table of Contents
Fetching ...

Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

Francisco Eiras, Kemal Oksuz, Adel Bibi, Philip H. S. Torr, Puneet K. Dokania

TL;DR

This work tackles referring image segmentation without dense mask annotations by proposing Segment-Select-Correct (S+S+C), a three-stage framework. Stage 1 generates open-vocabulary instance masks for referred objects; Stage 2 performs zero-shot selection of the correct mask using CLIP with visual prompting; Stage 3 bootstraps a RIS model and uses constrained greedy matching to correct zero-shot errors, trained on pseudo-masks. The approach achieves state-of-the-art results in weakly-supervised RIS and narrows the performance gap to fully-supervised methods, with notable gains on datasets containing multiple referenced objects per image. The method offers a practical path to high-performing RIS without costly mask annotations and can adapt to future advances in open-vocabulary segmentation and vision-language grounding.

Abstract

Referring Image Segmentation (RIS) - the problem of identifying objects in images through natural language sentences - is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction (segment), using zero-shot learning to select a potentially correct mask for the given instruction (select), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection (correct). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 16.5%, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around 33% to as little as 7%. Code is available at https://github.com/fgirbal/segment-select-correct.

Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

TL;DR

This work tackles referring image segmentation without dense mask annotations by proposing Segment-Select-Correct (S+S+C), a three-stage framework. Stage 1 generates open-vocabulary instance masks for referred objects; Stage 2 performs zero-shot selection of the correct mask using CLIP with visual prompting; Stage 3 bootstraps a RIS model and uses constrained greedy matching to correct zero-shot errors, trained on pseudo-masks. The approach achieves state-of-the-art results in weakly-supervised RIS and narrows the performance gap to fully-supervised methods, with notable gains on datasets containing multiple referenced objects per image. The method offers a practical path to high-performing RIS without costly mask annotations and can adapt to future advances in open-vocabulary segmentation and vision-language grounding.

Abstract

Referring Image Segmentation (RIS) - the problem of identifying objects in images through natural language sentences - is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction (segment), using zero-shot learning to select a potentially correct mask for the given instruction (select), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection (correct). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 16.5%, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around 33% to as little as 7%. Code is available at https://github.com/fgirbal/segment-select-correct.
Paper Structure (14 sections, 4 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Segment, Select, Correct for Referring Image Segmentation: our three stage approach consists of using an open-vocabulary segmentation step from referring expressions to obtain all the candidate masks for the object in those sentences (segment, Stage 1), followed by a zero-shot instance choice module to select the most likely right mask (select, Stage 2), and then training a corrected RIS model using constrained greedy matching to fix the zero-shot mistakes (correct, Stage 3).
  • Figure 2: Open-Vocabulary Segmentation from Referring Expressions: given a referring expression, we first extract the key noun phrase, project it to a set of context-specific classes, and then use open-vocabulary instance segmentation to obtain all the candidate masks for the object.
  • Figure 3: Zero-Shot Choice for Referring Image Segmentation: following the main idea from yang2023fine, we choose a zero-shot mask from the candidate ones by performing a visual prompting to obtain images with the object highlighted via reverse blurring, and then use CLIP similarity to determine the most likely mask choice.
  • Figure 4: Grounding + Constrained Greedy Matching: using set $\mathcal{S}_2$ masks, we start by pre-training a zero-shot bootstrapped model (ZSBootstrap) that grounds referring concepts which is used to initialize a corrected model trained using set $\mathcal{S}_1$ masks with constrained greedy matching.
  • Figure 5: Object Instances per Image: distribution of the number of object instances, $\mathbf{O}_{i,j}$, referenced in each image, $I_i$, within the training sets of the studied datasets.
  • ...and 1 more figures