Table of Contents
Fetching ...

Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng

Abstract

Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.

Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

Abstract

Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.

Paper Structure

This paper contains 18 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An example of referring image segmentation, where a free-form referring expression contains detailed attributes and complex relationships among multiple entities (e.g., bottle of water, guy, and yellow short), requiring relational reasoning to accurately localize the target region.
  • Figure 2: Illustration of different referring image segmentation paradigms. (a) Cross-modal alignment methods perform matching by directly aligning image and text representations. (b) MLLMs-based methods generate Semantic Segmentation Prompt to guide segmentation, but lack explicit spatial grounding. (c) Our PPCR progressively generates Semantic and Spatial Segmentation Prompt, explicitly bridging semantic understanding and spatial grounding for accurate segmentation.
  • Figure 3: Overview of PPCR for referring image segmentation. The framework follows a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Given an image and a referring expression, a multimodal large language model first generates a Semantic Segmentation Prompt (Semantic SP) via the LoRA A branch, which then guides the generation of a Spatial Segmentation Prompt (Spatial SP) via the LoRA B branch. The Spatial Segmentation Prompt is mapped to bounding box coordinates, and is jointly used with the Semantic Segmentation Prompt by SAM for instance segmentation. Purple and orange arrows denote semantic and spatial information flows.
  • Figure 4: Illustration of the Semantic Prompt Template used in the Semantic Understanding stage.
  • Figure 5: Illustration of the Spatial Prompt Template used in the Spatial Grounding stage.
  • ...and 2 more figures