Table of Contents
Fetching ...

Medal S: Spatio-Textual Prompt Model for Medical Segmentation

Pengcheng Shi, Jiawei Chen, Jiaqi Liu, Xinglin Zhang, Tao Chen, Lei Li

TL;DR

Medal S addresses the challenge of segmenting 3D medical volumes across diverse modalities by jointly leveraging native-resolution spatial prompts and textual priors in an end-to-end framework. Its core innovations—channel-wise alignment of volumetric prompts with text embeddings, parallel native-resolution spatial prompting, and a lightweight 3D refinement module—enable accurate multi-class segmentation with substantial efficiency gains, including over a 10x speedup for 24-class tasks. The approach employs dynamic resampling and a two-stage inference strategy to balance memory, speed, and precision, achieving strong gains on five modalities and up to 243 classes on BiomedSegFM while offering text-only and hybrid prompting modes. Practically, Medal S demonstrates improved segmentation fidelity and efficiency, though it still trails state-of-the-art BiomedParse-V in some benchmarks and identifies future work for small lesion robustness and ultrasound-focused adjustments.

Abstract

We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.

Medal S: Spatio-Textual Prompt Model for Medical Segmentation

TL;DR

Medal S addresses the challenge of segmenting 3D medical volumes across diverse modalities by jointly leveraging native-resolution spatial prompts and textual priors in an end-to-end framework. Its core innovations—channel-wise alignment of volumetric prompts with text embeddings, parallel native-resolution spatial prompting, and a lightweight 3D refinement module—enable accurate multi-class segmentation with substantial efficiency gains, including over a 10x speedup for 24-class tasks. The approach employs dynamic resampling and a two-stage inference strategy to balance memory, speed, and precision, achieving strong gains on five modalities and up to 243 classes on BiomedSegFM while offering text-only and hybrid prompting modes. Practically, Medal S demonstrates improved segmentation fidelity and efficiency, though it still trails state-of-the-art BiomedParse-V in some benchmarks and identifies future work for small lesion robustness and ultrasound-focused adjustments.

Abstract

We introduce Medal S, a medical segmentation foundation model that supports native-resolution spatial and textual prompts within an end-to-end trainable framework. Unlike text-only methods lacking spatial awareness, Medal S achieves channel-wise alignment between volumetric prompts and text embeddings, mitigating inaccuracies from resolution mismatches. By preserving full 3D context, it efficiently processes multiple native-resolution masks in parallel, enhancing multi-class segmentation performance. A lightweight 3D convolutional module enables precise voxel-space refinement guided by both prompt types, supporting up to 243 classes across CT, MRI, PET, ultrasound, and microscopy modalities in the BiomedSegFM dataset. Medal S offers two prompting modes: a text-only mode, where model predictions serve as spatial prompts for self-refinement without human input, and a hybrid mode, incorporating manual annotations for enhanced flexibility. For 24-class segmentation, parallel spatial prompting reduces inference time by more than 90% compared to sequential prompting. We propose dynamic resampling to address target-patch ratio imbalance, extending SAT and nnU-Net for data augmentation. Furthermore, we develop optimized text preprocessing, a two-stage inference strategy, and post-processing techniques to improve memory efficiency, precision, and inference speed. On the five-modality average on the validation set, Medal S outperforms SAT with a DSC of 75.44 (vs. 69.83), NSD of 77.34 (vs. 71.06), F1 of 38.24 (vs. 24.88), and DSC TP of 65.46 (vs. 46.97). Medal S achieves excellent performance by harmonizing spatial precision with semantic textual guidance, demonstrating superior efficiency and accuracy in multi-class medical segmentation tasks compared to sequential prompt-based approaches. Medal S will be publicly available at https://github.com/yinghemedical/Medal-S.

Paper Structure

This paper contains 28 sections, 10 equations, 5 figures, 6 tables, 3 algorithms.

Figures (5)

  • Figure 1: Left: Example renders from the BiomedSegFM challenge dataset (original images and segmentation masks) covering five imaging modalities: CT, MRI, microscopy, PET, and ultrasound. Top-right: Sample text prompts. Bottom-right: Key challenges include (1) multi-modal heterogeneity, (2) multi-class segmentation, and (3) target-patch ratio imbalance, causing spatio-textual misalignment, sequential inference inefficiency, and FP/FN errors. Our solutions: channel-wise prompt alignment (\ref{['sec:query decoder']}), parallel spatial prompts (\ref{['sec:query decoder']}), and dynamic resampling (\ref{['sec:dynamic resampling']}).
  • Figure 2: Medal S framework pipeline. Multi-scale visual features from the image encoder and text embeddings from the text encoder are fused by a query decoder into adapted embeddings. Parallel spatial prompts (simulated, predicted, or annotated) are processed at native resolution and aligned via channel-wise matching, maintaining full fidelity. This achieves a greater than $10\times$ speedup for 24-class segmentation versus sequential processing (see Fig. \ref{['fig:runtime_memory_vs_classes']}) and supports iterative self-refinement for precise segmentation.
  • Figure 3: Comparison of Medal S and ground truth results on the validation set for five different modalities. For each modality, we present both good segmentation results and bad segmentation results.
  • Figure 4: Efficiency comparison of spatial prompting strategies. (a) Inference runtime and (b) peak GPU memory consumption versus the number of classes. Parallel prompting achieves minimal time complexity with respect to the number of classes, resulting in a greater than $10\times$ speedup for 24-class segmentation over the sequential approach, whose runtime grows substantially. While parallel prompting requires moderately more memory, it remains within practical limits and offers a favorable trade-off for drastic time savings in multi-class scenarios.
  • Figure 5: Qualitative comparison of Medal S with different spatial prompt configurations. From left to right: (a) Input Image, (b) Ground Truth, (c) Medal S without spatial prompts, (d) Medal S with Stage-1 prediction as spatial prompts, and (e) Medal S with GT masks as spatial prompts. Each configuration is visualized in axial, coronal, sagittal views and 3D rendering, demonstrating the progressive improvement in segmentation quality with better spatial prompts - particularly in noise reduction, confusion resolution, and continuity enhancement.