Table of Contents
Fetching ...

MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu

TL;DR

This work defines MEMR-Seg, a task for multi-round, entity-level reasoning in medical image segmentation, and introduces MR-MedSeg, a large-scale dataset of 177K multi-round dialogues built from SA-Med2D-20M with GPT-5 augmentation. It proposes MediRound, a baseline model that fuses prior-round masks and dialogue history via an extended LLM–vision pipeline, augmented by a lightweight Judgment & Correction Mechanism to curb error propagation across rounds. Empirical results show MediRound outperforms traditional medical referring segmentation methods and SegLLM-style baselines in multi-round settings, while maintaining strong single-round performance. The work highlights the practical potential of interactive, reasoning-driven segmentation in clinical workflows and provides a foundation for future research in cross-round medical vision–language interaction.

Abstract

Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

TL;DR

This work defines MEMR-Seg, a task for multi-round, entity-level reasoning in medical image segmentation, and introduces MR-MedSeg, a large-scale dataset of 177K multi-round dialogues built from SA-Med2D-20M with GPT-5 augmentation. It proposes MediRound, a baseline model that fuses prior-round masks and dialogue history via an extended LLM–vision pipeline, augmented by a lightweight Judgment & Correction Mechanism to curb error propagation across rounds. Empirical results show MediRound outperforms traditional medical referring segmentation methods and SegLLM-style baselines in multi-round settings, while maintaining strong single-round performance. The work highlights the practical potential of interactive, reasoning-driven segmentation in clinical workflows and provides a foundation for future research in cross-round medical vision–language interaction.

Abstract

Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

Paper Structure

This paper contains 11 sections, 2 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: A demo dialogue of our proposed MediRound. Our model can comprehend user queries that refer to the mask results from previous rounds (e.g., the Round 2 query refers to the Round 1 mask result), enabling cross-round entity-level reasoning in multi-round medical conversations. In contrast, conventional text-prompt-based medical segmentation methods struggle in this complex task.
  • Figure 2: Overview of MR-MedSeg. Our dataset comprises five types of medical reasoning dialogues, each characterized by a specific form of inter-instance relationship, encompassing nearly all multi-round interaction scenarios encountered in real-world medical applications.
  • Figure 3: Semi-automatic pipeline for constructing MR-MedSeg dataset. The process includes three stages: entity selection, relationship generation, and template integration. The pipeline is primarily driven by manual annotation and complemented by GPT-5–based generation.
  • Figure 4: Overview of the MediRound framework. The figure illustrates the model’s workflow in processing the fourth-round conversation, referring to the second-round mask. The llava-med encoder $\mathcal{G}_v^{enc}$ consists of the vision encoder and visual projection layer from llava-med.
  • Figure 5: Illustration of how the Judgment & Correction Mechanism assists MediRound in multi-round reasoning. This mechanism evaluates and optimizes the quality of the [SEG] hidden layer features in each round, effectively preventing current-round errors from being propagated to later dialogues. Notably, the mechanism is not involved in the end-to-end training of MediRound and is only introduced during the evaluation process of the model.
  • ...and 2 more figures