Table of Contents
Fetching ...

CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

Yuxin Xie, Yuming Chen, Yishan Yang, Yi Zhou, Tao Zhou, Zhen Zhao, Jiacheng Liu, Huazhu Fu

TL;DR

This paper introduces ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation, and proposes CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter.

Abstract

Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/

CORE-Seg: Reasoning-Driven Segmentation for Complex Lesions via Reinforcement Learning

TL;DR

This paper introduces ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation, and proposes CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter.

Abstract

Medical image segmentation is undergoing a paradigm shift from conventional visual pattern matching to cognitive reasoning analysis. Although Multimodal Large Language Models (MLLMs) have shown promise in integrating linguistic and visual knowledge, significant gaps remain: existing general MLLMs possess broad common sense but lack the specialized visual reasoning required for complex lesions, whereas traditional segmentation models excel at pixel-level segmentation but lack logical interpretability. In this paper, we introduce ComLesion-14K, the first diverse Chain-of-Thought (CoT) benchmark for reasoning-driven complex lesion segmentation. To accomplish this task, we propose CORE-Seg, an end-to-end framework integrating reasoning with segmentation through a Semantic-Guided Prompt Adapter. We design a progressive training strategy from SFT to GRPO, equipped with an adaptive dual-granularity reward mechanism to mitigate reward sparsity. Our Method achieves state-of-the-art results with a mean Dice of 37.06\% (14.89\% higher than the second-best baseline), while reducing the failure rate to 18.42\%. Project Page: https://xyxl024.github.io/CORE-Seg.github.io/
Paper Structure (37 sections, 8 equations, 5 figures, 5 tables)

This paper contains 37 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Three major challenges in complex medical lesion segmentation, containing noisy and distorted imaging, lesion diversity, and pathological heterogeneity. (b) Supervised finetuned, end-to-end dense prediction architecture. (c) Reinforcement learning-driven, cascaded reasoning framework. (d) Our proposed end-to-end complex-lesion-centric reasoning segmentation framework, CORE-Seg.
  • Figure 2: Overview of the ComLesion-14K dataset construction and composition. (a)The dataset construction pipeline, including difficulty-aware filtering to select complex lesion samples, VQA and CoT generation for structured descriptions and reasoning, and automated quality assurance to ensure annotation consistency and reliability. (b) Dataset statistics summarizing the distribution of imaging modalities, anatomical structures, and disease categories. (c) Representative samples illustrating broad organ coverage across diverse imaging modalities.
  • Figure 3: Overview of the CORE-Seg framework. The architecture integrates an MLLM with a SAM via a Semantic-Guided Prompt Adapter. The training consists of two stages: (1) CoT-Based Semantic Alignment Stage: Supervised fine-tuning establishes the mapping between clinical reasoning and lesion localization using the <seg> token. (2) RL-Based Reasoning Exploration and Segmentation Refinement Stage: The model is further optimized using GRPO. A multi-dimensional Reward Engine (evaluating format, detection, and segmentation) serves as the critic, where a Dual-Granularity Mask Reward penalizes disconnected regions and encourages precise boundary delineation.
  • Figure 4: Comparison of mDice scores across 8 medical modalities. CORE-Seg demonstrates superior generalization capabilities, achieving state-of-the-art performance particularly in MRI, CT, X-Ray and Ultrasound.
  • Figure 5: Qualitative results of our two-stage method and baselines. Stage 2 refines Stage 1 with more image-grounded reasoning, improving mask quality and robustness, especially for small lesions and imperfect box localization.