Table of Contents
Fetching ...

Text-Promptable Propagation for Referring Medical Image Sequence Segmentation

Runtian Yuan, Mohan Chen, Jilan Xu, Ling Zhou, Qingqiu Li, Yuejie Zhang, Rui Feng, Tao Zhang, Shang Gao

TL;DR

Ref-MISS addresses segmenting anatomical structures in medical image sequences from natural language prompts. The authors introduce Text-Promptable Propagation (TPP), combining cross-modal referring interaction with Transformer-based triple propagation to track referred objects across $T$ frames conditioned on $N_p$ prompts, yielding masks $\{\hat{m}_t\}_{t=1}^T$. They also curate Ref-MISS-Bench, a large-scale dataset across 4 modalities and 20 structures, with prompts generated by LLMs and validated by radiologists. Experiments show that TPP outperforms state-of-the-art in medical segmentation and RVOS, with strong zero-/one-shot generalization and ablations highlighting the value of rich medical prompts and the propagation mechanism.

Abstract

Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (\emph{e.g.} endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. This task holds significant clinical potential and offers a user-friendly advancement in medical imaging interpretation. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for nteractive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation. TPP captures the intrinsic relationships among sequential images along with their associated textual descriptions. Specifically, it enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation.

Text-Promptable Propagation for Referring Medical Image Sequence Segmentation

TL;DR

Ref-MISS addresses segmenting anatomical structures in medical image sequences from natural language prompts. The authors introduce Text-Promptable Propagation (TPP), combining cross-modal referring interaction with Transformer-based triple propagation to track referred objects across frames conditioned on prompts, yielding masks . They also curate Ref-MISS-Bench, a large-scale dataset across 4 modalities and 20 structures, with prompts generated by LLMs and validated by radiologists. Experiments show that TPP outperforms state-of-the-art in medical segmentation and RVOS, with strong zero-/one-shot generalization and ablations highlighting the value of rich medical prompts and the propagation mechanism.

Abstract

Referring Medical Image Sequence Segmentation (Ref-MISS) is a novel and challenging task that aims to segment anatomical structures in medical image sequences (\emph{e.g.} endoscopy, ultrasound, CT, and MRI) based on natural language descriptions. This task holds significant clinical potential and offers a user-friendly advancement in medical imaging interpretation. Existing 2D and 3D segmentation models struggle to explicitly track objects of interest across medical image sequences, and lack support for nteractive, text-driven guidance. To address these limitations, we propose Text-Promptable Propagation (TPP), a model designed for referring medical image sequence segmentation. TPP captures the intrinsic relationships among sequential images along with their associated textual descriptions. Specifically, it enables the recognition of referred objects through cross-modal referring interaction, and maintains continuous tracking across the sequence via Transformer-based triple propagation, using text embeddings as queries. To support this task, we curate a large-scale benchmark, Ref-MISS-Bench, which covers 4 imaging modalities and 20 different organs and lesions. Experimental results on this benchmark demonstrate that TPP consistently outperforms state-of-the-art methods in both medical segmentation and referring video object segmentation.

Paper Structure

This paper contains 32 sections, 16 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Limitations and motivations. (a) Conventional 2D models do not incorporate temporal context and fail to utilize intrinsic consistencies in medical image sequences. (b) 3D models lack slice-level object representations for modeling continuity. (c) Multi-class segmentation models are limited to predefined classes and cannot use language to specify a particular class. (d) To address these limitations, Referring Medical Image Sequence Segmentation is introduced, offering substantial clinical values. (e) Our TPP leverages medical text prompts to segment referred objects across medical image sequences in both 2D and 3D data.
  • Figure 2: Architecture of our Text-Promptable Propagation for referring medical image sequence segmentation. (a) Overview of TPP. Triple Prop. is short for Triple Propagation. (b) Illustration of Triple Propagation in Transformer decoder, consisting of box-level, mask-level, and query-level propagation.
  • Figure 3: An illustration of focus areas in Ref-MISS-Bench. Each colored block represents specific organ/lesion class from corresponding [dataset], along with number of training and testing cases (images).
  • Figure 4: Ablation studies on text prompts and propagation strategies. Dice scores are provided for full model, without prompt, and without propagation, respectively.
  • Figure 5: Visualization of segmentation results for different structures and modalities. (a) and (b) display the results of left atrium and myocardium in the same MRIs, respectively. (c) and (d) show spleen and liver in the same CT slices, respectively. From (e) to (h), visualizations are: brain tumor in MRI, liver tumor in CT, polyp in endoscopy, and prostate in ultrasound.
  • ...and 1 more figures