Table of Contents
Fetching ...

CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, Yaowei Wang

TL;DR

The paper addresses reasoning segmentation by bridging language reasoning and pixel-level grounding through a differentiable positional prior heatmap. CoPRS introduces a learnable concentration token that yields a dense heatmap via Multimodal Chain-of-Thought (MCoT), which conditions a lightweight decoder to produce precise masks. It unifies language-driven reinforcement learning (GRPO) with segmentation supervision, demonstrating strong results on RefCOCO series and ReasonSeg, and providing interpretability through heatmap visualizations and correlation analyses between CoT quality, heatmaps, and masks. Overall, CoPRS establishes a promising direction for perception that tightly couples instruction-following reasoning with segmentation fidelity.

Abstract

Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above the prior state of the art across both validation and test partitions. Extensive experiments demonstrate a strong positive correlation among the CoT trajectory, the generated heatmap, and the decoded mask, supporting an interpretable alignment between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and in more precise mask prediction. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.

CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation

TL;DR

The paper addresses reasoning segmentation by bridging language reasoning and pixel-level grounding through a differentiable positional prior heatmap. CoPRS introduces a learnable concentration token that yields a dense heatmap via Multimodal Chain-of-Thought (MCoT), which conditions a lightweight decoder to produce precise masks. It unifies language-driven reinforcement learning (GRPO) with segmentation supervision, demonstrating strong results on RefCOCO series and ReasonSeg, and providing interpretability through heatmap visualizations and correlation analyses between CoT quality, heatmaps, and masks. Overall, CoPRS establishes a promising direction for perception that tightly couples instruction-following reasoning with segmentation fidelity.

Abstract

Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above the prior state of the art across both validation and test partitions. Extensive experiments demonstrate a strong positive correlation among the CoT trajectory, the generated heatmap, and the decoded mask, supporting an interpretable alignment between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and in more precise mask prediction. Code, checkpoints and logs are released at https://github.com/ZhenyuLU-Heliodore/CoPRS.git.

Paper Structure

This paper contains 22 sections, 12 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of paradigms for reasoning segmentation. (a) is exemplified by LISA lisa_2024, and (b) by Seg-Zero segzero_2025. Our CoPRS (c) bridges MCoT reasoning to segmentation through a differentiable and interpretable positional prior.
  • Figure 2: Overall architecture. Given image and text inputs, the policy generates CoT and concentration tokens, which query image keys to generate a positional prior, that is then decoded to masks. The policy and segmentation modules are jointly optimized.
  • Figure 3: Correlation analysis between the positional prior $\bm{H}_{\text{prior}}$ and the predicted mask $\hat{\bm{M}}$ during training and inference on RefCOCO(+/g) and ReasonSeg. Each blue point represents one training batch, while each red point represents one inference instance. Ordinary least squares (OLS) regression lines and mean confidence bands are overlaid.
  • Figure 3: CoT consistency. Consistency score ranges with sample counts, mean heatmap IoU, and mean mask IoU on RefCOCO+.
  • Figure 4: Correlation between CoT quality and segmentation quality (Heatmap/Mask IoU) on RefCOCO+. OLS results are overlaid.
  • ...and 4 more figures