Table of Contents
Fetching ...

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

Yi Lu, Jiawang Cao, Yongliang Wu, Bozheng Li, Licheng Tang, Yangguang Ji, Chong Wu, Jay Wu, Wenbo Zhu

TL;DR

RSVP tackles the gap between cognitive reasoning and fine-grained visual segmentation by introducing a two-stage framework that first generates interpretable region proposals via Multi-modal Chain-of-Thought Visual Prompting and then refines them with a Vision-Language Segmentation Module. By leveraging zero-shot reasoning capabilities of MLLMs and region-aware prompts, RSVP grounds reasoning in image regions without additional fine-tuning, achieving state-of-the-art performance on ReasonSeg and SegInW. The approach emphasizes modularity, interpretability, and efficiency, demonstrated through comprehensive ablations that highlight the importance of multimodal reasoning, prompt design, and the segmentation module. The results suggest RSVP offers a scalable path toward interpretable, grounding-enabled multimodal reasoning systems with strong open-world generalization.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

TL;DR

RSVP tackles the gap between cognitive reasoning and fine-grained visual segmentation by introducing a two-stage framework that first generates interpretable region proposals via Multi-modal Chain-of-Thought Visual Prompting and then refines them with a Vision-Language Segmentation Module. By leveraging zero-shot reasoning capabilities of MLLMs and region-aware prompts, RSVP grounds reasoning in image regions without additional fine-tuning, achieving state-of-the-art performance on ReasonSeg and SegInW. The approach emphasizes modularity, interpretability, and efficiency, demonstrated through comprehensive ablations that highlight the importance of multimodal reasoning, prompt design, and the segmentation module. The results suggest RSVP offers a scalable path toward interpretable, grounding-enabled multimodal reasoning systems with strong open-world generalization.

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.

Paper Structure

This paper contains 39 sections, 6 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: (a) and (b) depict different aspects of our segmentation pipeline performance. More demo results are available in \ref{['sec:appendix_case_study']}.
  • Figure 2: Overview of the proposed model. An input image is divided into horizontal and vertical regions to assist localization. In the reasoning stage, an MLLM receives a query about the object's protective features and identifies "the shell" as the protective object, generating region proposal using region IDs ($id_v$ and $id_h$). Red boxes indicate the regions of interest determined by the MLLM, yellow box denotes the padding $p$ for complete visual content. The CoT process enhances reasoning accuracy. In the segmentation stage, a multi-modal encoder integrates textual and visual information, resizing the image for detailed feature extraction. Finally, SAM refines the segmentation by highlighting the shell that acts as a protective covering for the snail.
  • Figure 3: Illustration of the CoT processing strategy in action for a query about a dragon boat race.
  • Figure 4: Illustration of Incorrect Localization cases produced by LISA.
  • Figure 5: Illustration of Low-quality Segmentation mask cases produced by LISA.
  • ...and 8 more figures