Table of Contents
Fetching ...

MediSee: Reasoning-based Pixel-level Perception in Medical Images

Qinyue Tong, Ziqian Lu, Jun Liu, Yangming Zheng, Zheming Lu

TL;DR

This work defines Medical Reasoning Segmentation and Detection (MedSD), enabling segmentation and detection from implicit, knowledge-driven medical queries. It introduces the MLMR-SD dataset (over 200K QA pairs and 12,652 image-mask pairs across 109 medical objects) and MediSee, a baseline that fuses multiple candidate token features via Adaptive Democratic Candidate Fusion and uses similarity-map supervision to enhance reasoning. Across MLMR-SD and traditional SA-Med2D benchmarks, MediSee achieves superior segmentation and detection performance while providing textual explanations, addressing interactivity gaps in medical perception. The approach advances interactive, reasoning-based medical image understanding with potential clinical and research impact.

Abstract

Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.

MediSee: Reasoning-based Pixel-level Perception in Medical Images

TL;DR

This work defines Medical Reasoning Segmentation and Detection (MedSD), enabling segmentation and detection from implicit, knowledge-driven medical queries. It introduces the MLMR-SD dataset (over 200K QA pairs and 12,652 image-mask pairs across 109 medical objects) and MediSee, a baseline that fuses multiple candidate token features via Adaptive Democratic Candidate Fusion and uses similarity-map supervision to enhance reasoning. Across MLMR-SD and traditional SA-Med2D benchmarks, MediSee achieves superior segmentation and detection performance while providing textual explanations, addressing interactivity gaps in medical perception. The approach advances interactive, reasoning-based medical image understanding with potential clinical and research impact.

Abstract

Despite remarkable advancements in pixel-level medical image perception, existing methods are either limited to specific tasks or heavily rely on accurate bounding boxes or text labels as input prompts. However, the medical knowledge required for input is a huge obstacle for general public, which greatly reduces the universality of these methods. Compared with these domain-specialized auxiliary information, general users tend to rely on oral queries that require logical reasoning. In this paper, we introduce a novel medical vision task: Medical Reasoning Segmentation and Detection (MedSD), which aims to comprehend implicit queries about medical images and generate the corresponding segmentation mask and bounding box for the target object. To accomplish this task, we first introduce a Multi-perspective, Logic-driven Medical Reasoning Segmentation and Detection (MLMR-SD) dataset, which encompasses a substantial collection of medical entity targets along with their corresponding reasoning. Furthermore, we propose MediSee, an effective baseline model designed for medical reasoning segmentation and detection. The experimental results indicate that the proposed method can effectively address MedSD with implicit colloquial queries and outperform traditional medical referring segmentation methods.

Paper Structure

This paper contains 15 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We introduce MediSee, a model that opens the door to medical reasoning image perception, capable of handling cases that demand complex reasoning and domain-specific medical knowledge. Notably, our model is also able to generate corresponding textual explanations for the given queries, enhancing the interpretability and interactivity.
  • Figure 2: The data generation pipeline of MLMR-SD dataset (left) and its pseudocode (right). The figure illustrates the process of generating question-answer pairs for the medical object "lung" in the image. All variables are defined in the pseudocode.
  • Figure 3: Analysis of the data structure in MLMR-SD. (a) word cloud in MLMR-SD; (b) the frequency distribution for each medical object across the question-answer pairs and images.
  • Figure 4: The overview of MediSee framework and the similarity loss introduced during the additional fine-tuning phase (the lower right corner). $h_{c_i}$, $h_{img}$ and $h_{txt}$ indicate the candidate token embedding, image embedding and text embedding, which are all derived from the last hidden-layer of LLaVA-Med's output. $Sim(\cdot)$ represents the computation of the dot product similarity. $\hat{h}_{img}$ and $\hat{h}_{seg}$ represent the image embedding and the input features for the mask decoder, which are obtained based on non-inference query $\hat{x}_{txt}$.
  • Figure 5: Visualizations of different methods. Some mask areas with tiny response are marked by yellow dashed lines.