Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models
Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng Zuo
TL;DR
This work addresses zero-shot referring image segmentation by integrating a generative model-based pipeline with an optional discriminative module. The Generative Process derives a correlation map via cross-attention in Stable Diffusion to form proposals, while the Discriminative Process provides an alternative scoring using CLIP features; final segmentation follows s_i = α s^G_i + (1−α) s^D_i. Experiments on RefCOCO, RefCOCO+, and RefCOCOg show that a purely generative approach can rival weakly supervised state-of-the-art methods, and combining generative and discriminative signals yields substantial improvements. The results indicate that generative models offer intrinsic proposal generation and localization cues, enabling training-free RIS with practical impact and opening a new direction for leveraging multi-modal generative models in visual grounding tasks.
Abstract
Zero-shot referring image segmentation is a challenging task because it aims to find an instance segmentation mask based on the given referring descriptions, without training on this type of paired data. Current zero-shot methods mainly focus on using pre-trained discriminative models (e.g., CLIP). However, we have observed that generative models (e.g., Stable Diffusion) have potentially understood the relationships between various visual elements and text descriptions, which are rarely investigated in this task. In this work, we introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task, which leverages the fine-grained multi-modal information from generative models. We demonstrate that without a proposal generator, a generative model alone can achieve comparable performance to existing SOTA weakly-supervised models. When we combine both generative and discriminative models, our Ref-Diff outperforms these competing methods by a significant margin. This indicates that generative models are also beneficial for this task and can complement discriminative models for better referring segmentation. Our code is publicly available at https://github.com/kodenii/Ref-Diff.
