Table of Contents
Fetching ...

Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng Zuo

TL;DR

This work addresses zero-shot referring image segmentation by integrating a generative model-based pipeline with an optional discriminative module. The Generative Process derives a correlation map via cross-attention in Stable Diffusion to form proposals, while the Discriminative Process provides an alternative scoring using CLIP features; final segmentation follows s_i = α s^G_i + (1−α) s^D_i. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that a purely generative approach can rival weakly supervised state-of-the-art methods, and combining generative and discriminative signals yields substantial improvements. The results indicate that generative models offer intrinsic proposal generation and localization cues, enabling training-free RIS with practical impact and opening a new direction for leveraging multi-modal generative models in visual grounding tasks.

Abstract

Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.

Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models

TL;DR

This work addresses zero-shot referring image segmentation by integrating a generative model-based pipeline with an optional discriminative module. The Generative Process derives a correlation map via cross-attention in Stable Diffusion to form proposals, while the Discriminative Process provides an alternative scoring using CLIP features; final segmentation follows s_i = α s^G_i + (1−α) s^D_i. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that a purely generative approach can rival weakly supervised state-of-the-art methods, and combining generative and discriminative signals yields substantial improvements. The results indicate that generative models offer intrinsic proposal generation and localization cues, enabling training-free RIS with practical impact and opening a new direction for leveraging multi-modal generative models in visual grounding tasks.

Abstract

Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.
Paper Structure (21 sections, 10 equations, 6 figures, 5 tables)

This paper contains 21 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of our Ref-Diff. Our proposed Generative Process (left) generates a correlation matrix between the referring text and the input image. This matrix serves as an alternative weight-free proposal generator and generative segmentation candidates $\mathbf{s}^{\mathrm{G}}$. The Discriminative Process (right) is alternatively integrated into our framework and generates the discriminative candidates $\mathbf{s}^{\mathrm{D}}$. The final referring segmentation result is obtained either from the generative candidates or a combination of both generative and discriminative candidates.
  • Figure 2: Effectiveness of generative model in segmentation capability.Ref-Diff/g is capable of segmenting the right content even without the assistance of the pre-trained segmentor and CLIP. Combing with the pre-trained segmentor, Ref-Diff/gs achieves precise segmentation of the correct regions.
  • Figure 3: Effectiveness of generative model in localization capability. The discriminative model focuses more on whether the image contains text-related content, which may result in mistakenly selecting larger regions.
  • Figure 4: Effectiveness of discriminative model. The generative model exhibits higher sensitivity to salient visual features, which can result in partial segmentation when solely relying on the generative model. By integrating the discriminative model, we can effectively mitigate such errors and achieve more accurate results.
  • Figure 5: Attention from generative model. Generative model projects attention to different regions of the image based on different tokens, which is the key reason for the effectiveness of Ref-Diff. The dashed box highlights the root token and its corresponding attention map.
  • ...and 1 more figures