Table of Contents
Fetching ...

LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

Jiachen Li, Qing Xie, Renshu Gu, Jinyu Xu, Yongjian Liu, Xiaohan Yu

TL;DR

This work tackles zero-shot Referring Image Segmentation by addressing the mismatch between visual regions and free-form referring expressions. It introduces LGD, which uses two prompts to guide Multi-Modal Large Language Models in generating fine-grained attribute descriptions and surrounding-object descriptions, paired with three CLIP-based matching scores to align instance-level visuals with textual cues. The method computes $S_{att}$, $S_{sur}$, and $S_{van}$ and fuses them as $S = S_{van} + \alpha S_{att} + \beta S_{sur}$ to select the target mask, enabling robust zero-shot grounding. LGD achieves new state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg, with notable improvements in oIoU and mIoU, and demonstrates the effectiveness of prompt-driven language guidance for cross-modal segmentation in complex scenes. These findings highlight the potential of integrating MLLMs with Vision-Language Models to substantially improve fine-grained cross-modal grounding without additional task-specific training.

Abstract

Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression, with the primary challenge of aligning and matching semantics across visual and textual modalities without training. Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching. However, this paradigm may lead to incorrect target localization due to the inherent ambiguity and diversity of free-form referring expressions. To alleviate this issue, we present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models to enhance region-text matching performance in Vision-Language Models. Specifically, we first design two kinds of prompts, the attribute prompt and the surrounding prompt, to guide the Multi-Modal Large Language Models in generating descriptions related to the crucial attributes of the referent object and the details of surrounding objects, referred to as attribute description and surrounding description, respectively. Secondly, three visual-text matching scores are introduced to evaluate the similarity between instance-level visual features and textual features, which determines the mask most associated with the referring expression. The proposed method achieves new state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU compared to previous methods.

LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation

TL;DR

This work tackles zero-shot Referring Image Segmentation by addressing the mismatch between visual regions and free-form referring expressions. It introduces LGD, which uses two prompts to guide Multi-Modal Large Language Models in generating fine-grained attribute descriptions and surrounding-object descriptions, paired with three CLIP-based matching scores to align instance-level visuals with textual cues. The method computes , , and and fuses them as to select the target mask, enabling robust zero-shot grounding. LGD achieves new state-of-the-art results on RefCOCO, RefCOCO+, and RefCOCOg, with notable improvements in oIoU and mIoU, and demonstrates the effectiveness of prompt-driven language guidance for cross-modal segmentation in complex scenes. These findings highlight the potential of integrating MLLMs with Vision-Language Models to substantially improve fine-grained cross-modal grounding without additional task-specific training.

Abstract

Zero-shot referring image segmentation aims to locate and segment the target region based on a referring expression, with the primary challenge of aligning and matching semantics across visual and textual modalities without training. Previous works address this challenge by utilizing Vision-Language Models and mask proposal networks for region-text matching. However, this paradigm may lead to incorrect target localization due to the inherent ambiguity and diversity of free-form referring expressions. To alleviate this issue, we present LGD (Leveraging Generative Descriptions), a framework that utilizes the advanced language generation capabilities of Multi-Modal Large Language Models to enhance region-text matching performance in Vision-Language Models. Specifically, we first design two kinds of prompts, the attribute prompt and the surrounding prompt, to guide the Multi-Modal Large Language Models in generating descriptions related to the crucial attributes of the referent object and the details of surrounding objects, referred to as attribute description and surrounding description, respectively. Secondly, three visual-text matching scores are introduced to evaluate the similarity between instance-level visual features and textual features, which determines the mask most associated with the referring expression. The proposed method achieves new state-of-the-art performance on three public datasets RefCOCO, RefCOCO+ and RefCOCOg, with maximum improvements of 9.97% in oIoU and 11.29% in mIoU compared to previous methods.

Paper Structure

This paper contains 16 sections, 9 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: An illustration demonstrating the inherent ambiguity and diversity of free-form referring expressions. (a) The referring expression describes the clothing of the referent object without explicitly mentioning the man wearing a white shirt. (b) The referring expression highlights the action of the referent object rather than explicitly describing the woman making the call. (c) The referring expression conveys the registration number of the referent object rather than explicitly mentioning the aircraft registered as n177xy. Referring expressions are unrestricted in describing object and lack detailed descriptions of referent object, making it challenging for the model to accurately locate the referent object. We introduced attribute description and surrounding description, which assist in identifying the referent object and its crucial attributes while distinguishing it from surrounding objects, providing the model with fine-grained information.
  • Figure 2: The pipeline of LGD. We construct the attribute prompt and surrounding prompt by combining instructions and referring expressions. Given the input image and the different prompts, the MLLMs generates the attribute description and surrounding description. After extracting features from the image and text using CLIP, we calculate three visual-text matching scores and obtain the most relevant mask by linearly combining these scores.
  • Figure 3: Sensitive toward $\alpha$ and $\beta$ on RefCOCO.
  • Figure 4: Sensitive toward $\alpha$ and $\beta$ on RefCOCO+.
  • Figure 5: Sensitive toward $\alpha$ and $\beta$ on RefCOCOg.
  • ...and 2 more figures