Table of Contents
Fetching ...

Discriminative Perception via Anchored Description for Reasoning Segmentation

Tao Yang, Qing Zhou, Yanliang Li, Qi Wang

TL;DR

DPAD is proposed to compel the model to generate a descriptive caption of the referred object, which is used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context, leading to a more converged and efficient reasoning chain.

Abstract

Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at https://github.com/mrazhou/DPAD

Discriminative Perception via Anchored Description for Reasoning Segmentation

TL;DR

DPAD is proposed to compel the model to generate a descriptive caption of the referred object, which is used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context, leading to a more converged and efficient reasoning chain.

Abstract

Reasoning segmentation increasingly employs reinforcement learning to generate explanatory reasoning chains that guide Multimodal Large Language Models. While these geometric rewards are primarily confined to guiding the final localization, they are incapable of discriminating whether the reasoning process remains anchored on the referred region or strays into irrelevant context. Lacking this discriminative guidance, the model's reasoning often devolves into unfocused and verbose chains that ultimately fail to disambiguate and perceive the target in complex scenes. This suggests a need to complement the RL objective with Discriminative Perception, an ability to actively distinguish a target from its context. To realize this, we propose DPAD to compel the model to generate a descriptive caption of the referred object, which is then used to explicitly discriminate by contrasting the caption's semantic relevance to the referred object against the wider context. By optimizing for this discriminative capability, the model is forced to focus on the unique attributes of the target, leading to a more converged and efficient reasoning chain. The descriptive caption also serves as an interpretability rationale that aligns with the segmentation. Experiments on the benchmarks confirm the validity of our approach, delivering substantial performance gains, with the cIoU on ReasonSeg increasing by 3.09% and the reasoning chain length decreasing by approximately 42%. Code is available at https://github.com/mrazhou/DPAD
Paper Structure (18 sections, 6 equations, 7 figures, 8 tables)

This paper contains 18 sections, 6 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Comparison between Seg-Zero's unfocused reasoning (green), which leads to lower performance, and DPAD's focused reasoning (blue), which is guided by Discriminative Perception to achieve substantial improvements.
  • Figure 2: An overview of the DPAD framework, where a MLLM generates a reasoning chain ($T$), a geometric localization ($A$), and an anchored descriptive caption ($C$). The core of our method is the Discriminative Perception reward, which uses the generated caption to derive a Region of Interest (ROI) Score ($S\textsubscript{1}$) and an All of Image (AOI) Score ($S\textsubscript{2}$). The contrast between these scores provides a binary reward signal that incentivizes the model to generate focused reasoning capable of distinguishing the target from its context.
  • Figure 3: Qualitative comparison between the Seg-Zero and our DPAD. The figure illustrates examples where Seg-Zero produces "Long & Unfocused" reasoning chains that often stray into irrelevant context before identifying the target. In contrast, DPAD generates "Short & Focused" chains that are more converged and efficient. This demonstrates how DPAD's Discriminative Perception leads to more precise reasoning and a significant reduction in token count.
  • Figure 4: Per-sample token count comparison on the ReasonSeg test. DPAD is shown in blue, and Seg-Zero is shown in green.
  • Figure 5: Comparison of the average token counts generated by DPAD and Seg-Zero across five datasets. The lines represent the mean values, while the shaded areas indicate the standard deviation. The results confirm that DPAD accomplishes accurate reasoning using significantly fewer tokens and exhibiting lower variance across all benchmarks.
  • ...and 2 more figures