Table of Contents
Fetching ...

MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model

Yiwen Ye, Ziyang Chen, Jianpeng Zhang, Yutong Xie, Yong Xia

TL;DR

MedUniSeg is introduced, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains and surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.

Abstract

Universal segmentation models offer significant potential in addressing a wide range of tasks by effectively leveraging discrete annotations. As the scope of tasks and modalities expands, it becomes increasingly important to generate and strategically position task- and modal-specific priors within the universal model. However, existing universal models often overlook the correlations between different priors, and the optimal placement and frequency of these priors remain underexplored. In this paper, we introduce MedUniSeg, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains. MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks. To generate the related priors, we propose the modal map (MMap) and the fusion and selection (FUSE) modules, which transform modal and task prompts into corresponding priors. These modal and task priors are systematically introduced at the start and end of the encoding process. We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets. The results demonstrate that MedUniSeg achieves superior multi-task segmentation performance, attaining a 1.2% improvement in the mean Dice score across the 17 upstream tasks compared to nnUNet baselines, while using less than 1/10 of the parameters. For tasks that underperform during the initial multi-task joint training, we freeze MedUniSeg and introduce new modules to re-learn these tasks. This approach yields an enhanced version, MedUniSeg*, which consistently outperforms MedUniSeg across all tasks. Moreover, MedUniSeg surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.

MedUniSeg: 2D and 3D Medical Image Segmentation via a Prompt-driven Universal Model

TL;DR

MedUniSeg is introduced, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains and surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.

Abstract

Universal segmentation models offer significant potential in addressing a wide range of tasks by effectively leveraging discrete annotations. As the scope of tasks and modalities expands, it becomes increasingly important to generate and strategically position task- and modal-specific priors within the universal model. However, existing universal models often overlook the correlations between different priors, and the optimal placement and frequency of these priors remain underexplored. In this paper, we introduce MedUniSeg, a prompt-driven universal segmentation model designed for 2D and 3D multi-task segmentation across diverse modalities and domains. MedUniSeg employs multiple modal-specific prompts alongside a universal task prompt to accurately characterize the modalities and tasks. To generate the related priors, we propose the modal map (MMap) and the fusion and selection (FUSE) modules, which transform modal and task prompts into corresponding priors. These modal and task priors are systematically introduced at the start and end of the encoding process. We evaluate MedUniSeg on a comprehensive multi-modal upstream dataset consisting of 17 sub-datasets. The results demonstrate that MedUniSeg achieves superior multi-task segmentation performance, attaining a 1.2% improvement in the mean Dice score across the 17 upstream tasks compared to nnUNet baselines, while using less than 1/10 of the parameters. For tasks that underperform during the initial multi-task joint training, we freeze MedUniSeg and introduce new modules to re-learn these tasks. This approach yields an enhanced version, MedUniSeg*, which consistently outperforms MedUniSeg across all tasks. Moreover, MedUniSeg surpasses advanced self-supervised and supervised pre-trained models on six downstream tasks, establishing itself as a high-quality, highly generalizable pre-trained segmentation model.
Paper Structure (29 sections, 3 equations, 6 figures, 8 tables)

This paper contains 29 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Comparison between the mainstream solution and our solution. The mainstream solution treats both 2D and 3D data as 1D tokens and utilizes a Transformer-based model for processing. In contrast, our solution interprets 2D data as pseudo-3D data and employs a 3D CNN-based model for processing. (b) Performance and parameter comparisons between nnUNet and MedUniSeg* across 17 upstream datasets. To achieve the same tasks, nnUNet requires 17 individual models, comprising 11 3D models and 6 2D models, while our MedUniSeg* needs only a single model.
  • Figure 2: Technical pipeline of our MedUniSeg, including the MMap module, a vision encoder, the FUSE module, and a prompt-driven decoder. For an input image, we identify its modality ID and task ID. Based on these identifiers, the MMap module generates modal-specific priors, while the FUSE module produces task-specific priors. These priors are integrated at the start and end of the encoding process, enabling MedUniSeg to effectively handle multiple modalities and tasks.
  • Figure 3: Schematic representation of MedUniSeg, UniSeg, Multiple Prompts, Universal Prompts, Fixed Prompts, Bottleneck Prompts, and MedUniSeg-T. Multiple Prompts utilizes multiple task-specific and modal-specific prompts. Universal Prompts adopts a universal modal prompt and a universal task prompt. Fixed Prompts initializes with zero prompts, remaining unchanged. Bottleneck Prompts incorporates both priors at the bottleneck of the encoder. MedUniSeg-T introduces the task-related prompt at the end of the decoder. The selection and fusion (SEFU) module first selects a modal-specific prompt and then fuses the features with the prompt. The $Sel.$ operation is used to extract the modal-specific prior from the universal prompt generated by the MMap module. Task-related information is highlighted in purple, while modal-related information is highlighted in green.
  • Figure 4: Visualization of segmentation results obtained from UKAN, UMamba, nnUNet, Universal Model, Hermes, DoDNet, CCQ, UniSeg, and MedUniSeg, along with the ground truths (GTs) on seven datasets. Organs are depicted in red, while tumors and lesions are shown in green. Blue rectangles highlight the differences among the models.
  • Figure 5: Visualization of segmentation results obtained from Swin UNETR, BT, UniMiSS, DeSD, Universal Model, Universal Model$\dag$, Hermes, DoDNet, CCQ, UniSeg, and MedUniSeg, along with the ground truths (GTs) on six datasets. Blue rectangles highlight the differences among the models.
  • ...and 1 more figures