Table of Contents
Fetching ...

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

Seonghoon Yu, Paul Hongsuck Seo, Jeany Son

TL;DR

This work tackles the high labeling cost of referring image segmentation by proposing Pseudo-RIS, a framework that automatically generates high-quality segmentation masks paired with distinctive referring expressions as pseudo-supervisions, enabling training of supervised RIS models without manual masks. The approach combines a segmentation foundation model (SAM) with a captioning foundation model (CoCa) and introduces two key strategies: distinctive caption sampling to produce target-focused expressions and a CLIP-based distinctiveness filtering to ensure referential clarity. Empirically, Pseudo-RIS achieves state-of-the-art performance on RIS benchmarks, demonstrates strong cross-domain and open-world generalization, and remains effective in semi-supervised settings when limited human labels are available. The method reduces labeling costs, leverages broad domain coverage of foundation models, and offers practical benefits for scalable RIS in diverse, real-world environments, while also acknowledging biases and limitations associated with large pre-trained models.

Abstract

We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.

Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation

TL;DR

This work tackles the high labeling cost of referring image segmentation by proposing Pseudo-RIS, a framework that automatically generates high-quality segmentation masks paired with distinctive referring expressions as pseudo-supervisions, enabling training of supervised RIS models without manual masks. The approach combines a segmentation foundation model (SAM) with a captioning foundation model (CoCa) and introduces two key strategies: distinctive caption sampling to produce target-focused expressions and a CLIP-based distinctiveness filtering to ensure referential clarity. Empirically, Pseudo-RIS achieves state-of-the-art performance on RIS benchmarks, demonstrates strong cross-domain and open-world generalization, and remains effective in semi-supervised settings when limited human labels are available. The method reduces labeling costs, leverages broad domain coverage of foundation models, and offers practical benefits for scalable RIS in diverse, real-world environments, while also acknowledging biases and limitations associated with large pre-trained models.

Abstract

We propose a new framework that automatically generates high-quality segmentation masks with their referring expressions as pseudo supervisions for referring image segmentation (RIS). These pseudo supervisions allow the training of any supervised RIS methods without the cost of manual labeling. To achieve this, we incorporate existing segmentation and image captioning foundation models, leveraging their broad generalization capabilities. However, the naive incorporation of these models may generate non-distinctive expressions that do not distinctively refer to the target masks. To address this challenge, we propose two-fold strategies that generate distinctive captions: 1) 'distinctive caption sampling', a new decoding method for the captioning model, to generate multiple expression candidates with detailed words focusing on the target. 2) 'distinctiveness-based text filtering' to further validate the candidates and filter out those with a low level of distinctiveness. These two strategies ensure that the generated text supervisions can distinguish the target from other objects, making them appropriate for the RIS annotations. Our method significantly outperforms both weakly and zero-shot SoTA methods on the RIS benchmark datasets. It also surpasses fully supervised methods in unseen domains, proving its capability to tackle the open-world challenge within RIS. Furthermore, integrating our method with human annotations yields further improvements, highlighting its potential in semi-supervised learning applications.
Paper Structure (49 sections, 7 equations, 10 figures, 15 tables)

This paper contains 49 sections, 7 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: Illustration of our distinctive pseudo-supervision generation: (a) a distinctive caption "a brown cow with a long tail" distinctively refers to a target mask, while a non-distinctive caption "a cow with a tail" causes misleading to a non-targeted object since there is another cow with a tail in an image, (b) two proposed methods for distinctive caption generation given segmentation masks: 1) multiple caption candidates generation with target-specific words; 2) filtering out the misleading captions among all candidates.
  • Figure 2: Distinctive caption candidates generation of our Pseudo-RIS. Given segmentation masks, we generate multiple distinctive caption candidates on each mask using a frozen image captioning model with the proposed distinctive caption sampling. This method calibrates the word distribution of the target by utilizing that of others, and then samples the next word from a calibrated distribution of the target.
  • Figure 3: Distinctiveness-based text filtering of our Pseudo-RIS. On each mask, we filter out non-distinctive caption candidates with a distinctiveness score below a threshold $\tau$. This distinctiveness score measures how well a caption distinctively refers to a target mask, with examples illustrating: (a) incorrect caption describing a non-targeted horse, resulting in a low correctness despite its high uniqueness, (b) ambiguous caption referring to an unintended mask, yielding low uniqueness but high correctness, and (c) a target-specific distinctive caption, achieving high uniqueness and correctness scores, leading to high distinctiveness.
  • Figure 4: mIoU results of two models: our Pseudo-RIS in a semi-supervised setting and a fully supervised model, across different volumes of human-labeled data. Our method uses both our pseudo-supervisions with varying proportions of human-labeled data, while the fully supervised method only relies on human labeled data. Both models are based on the ETRIS etris. We also provide performance comparisons with other semi-supervised RIS methods semi_ris_1semi_ris_2semi_safari in our supplementary.
  • Figure 5: Qualitative analysis of our generated expressions compared to the naïve methods and GT captions.
  • ...and 5 more figures