Table of Contents
Fetching ...

DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Numan Saeed, Tausifa Jan Saleem, Fadillah Maani, Muhammad Ridzuan, Hu Wang, Mohammad Yaqub

TL;DR

DuPLUS introduces a hierarchical, text-controlled vision-language framework for universal medical image segmentation and prognosis. By decoupling modality context (T1) and target specification (T2) through FiLM-based conditioning and a dual-prompt encoder, the model generalizes across CT, MRI, and PET datasets and supports prognosis via accelerated fine-tuning with LoRA and EHR integration. Empirical results show state-of-the-art universal segmentation on 8 of 10 datasets and competitive prognosis on HECKTOR (CI = 0.69), with strong qualitative evidence of flexible, on-demand organ targeting and cross-modality adaptability. The approach offers a practical path toward clinically relevant, extensible AI tools for multimodal medical imaging, with code available for reproducibility.

Abstract

Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing 'universal' approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

TL;DR

DuPLUS introduces a hierarchical, text-controlled vision-language framework for universal medical image segmentation and prognosis. By decoupling modality context (T1) and target specification (T2) through FiLM-based conditioning and a dual-prompt encoder, the model generalizes across CT, MRI, and PET datasets and supports prognosis via accelerated fine-tuning with LoRA and EHR integration. Empirical results show state-of-the-art universal segmentation on 8 of 10 datasets and competitive prognosis on HECKTOR (CI = 0.69), with strong qualitative evidence of flexible, on-demand organ targeting and cross-modality adaptability. The approach offers a practical path toward clinically relevant, extensible AI tools for multimodal medical imaging, with code available for reproducibility.

Abstract

Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing 'universal' approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

Paper Structure

This paper contains 24 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Architecture of DuPLUS, a multimodal deep learning network controlled by text prompts. This diagram showcases DuPLUS's key components: the dual-prompt mechanism for text control and the FiLM layers for modality adaptation. It also illustrates the model's extensibility to prognosis prediction via a dedicated prediction module.
  • Figure 2: A visualization of the used medical imaging datasets. (a) Distribution of anatomical classes across CT, MRI, and PET modalities. (b) Significant data imbalance in dataset sizes and classes of different structures is observed, which is a challenge for robust model training.
  • Figure 3: Controllable, dual prompt-driven organ segmentation across datasets. Each row shows one CT dataset (BCV abdomen; STRUCTSEG OAR thorax). Columns: CT image, four prompted predictions, and Ground Truth. Inference uses two text prompts: (T1) a modality/region context, fixed per row (“A computed tomography of abdomen” for BCV; “A computed tomography of thorax” for STRUCTSEG OAR); and (T2) a target-organ prompt that is changed per column (“A computed tomography of spleen/liver/pancreas/left kidney” in BCV; “left lung/right lung/heart/spinal cord” in OAR). Holding T1 constant and switching only T2 deterministically switches the predicted structure on the same slice, demonstrating fine-grained text control without altering the image or model weights. Colored title fonts indicate the mask color for each organ; “CT image” and “Ground Truth” provide qualitative reference.
  • Figure 4: Cross-modality organ segmentation using modality-aware text prompts. Rows show the same abdominal organs segmented from CT (top, AMOS CT) and MR (bottom, AMOS MR) images. The model uses dual prompts: T1 specifies the modality/region context, while T2 targets specific organs. By adapting T1 to the imaging modality ("computed tomography" vs "magnetic resonance"), the model successfully segments the same anatomical structures across both modalities without modality-specific training.
  • Figure 5: Dual-Prompt Feature Disentanglement Across Medical Imaging Modalities. UMAP projection of bottleneck features under controlled prompt conditioning. $\bullet$Set A: Fixed target, varying context/modality prompts. $\blacksquare$Set B: Fixed context, varying target prompts.