Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation
Francisco Eiras, Kemal Oksuz, Adel Bibi, Philip H. S. Torr, Puneet K. Dokania
TL;DR
This work tackles referring image segmentation without dense mask annotations by proposing Segment-Select-Correct (S+S+C), a three-stage framework. Stage 1 generates open-vocabulary instance masks for referred objects; Stage 2 performs zero-shot selection of the correct mask using CLIP with visual prompting; Stage 3 bootstraps a RIS model and uses constrained greedy matching to correct zero-shot errors, trained on pseudo-masks. The approach achieves state-of-the-art results in weakly-supervised RIS and narrows the performance gap to fully-supervised methods, with notable gains on datasets containing multiple referenced objects per image. The method offers a practical path to high-performing RIS without costly mask annotations and can adapt to future advances in open-vocabulary segmentation and vision-language grounding.
Abstract
Referring Image Segmentation (RIS) - the problem of identifying objects in images through natural language sentences - is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction (segment), using zero-shot learning to select a potentially correct mask for the given instruction (select), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection (correct). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 16.5%, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around 33% to as little as 7%. Code is available at https://github.com/fgirbal/segment-select-correct.
