Table of Contents
Fetching ...

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

Yujie Lu, Jingwen Li, Sibo Ju, Yanzhou Su, he yao, Yisong Liu, Min Zhu, Junlong Cheng

TL;DR

SegMoTE is proposed, the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

Abstract

Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

TL;DR

SegMoTE is proposed, the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

Abstract

Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.
Paper Structure (16 sections, 11 equations, 7 figures, 4 tables)

This paper contains 16 sections, 11 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: SegMoTE vs. Previous Works. The heterogeneous data $X$ is first processed by the encoder $\varepsilon$ to extract the feature representation $f$. (a) Previous methods typically perform full fine-tuning of the mask decoder or parameter-efficient fine-tuning, leading to distribution shift from the pretrained model. (b) SegMoTE introduces a token-level mixture of experts mechanism that dynamically selects modality-adaptive expert tokens while keeping the mask decoder frozen. The process is guided by the load balancing loss $L_{balance}$, is constrained using the squared coefficient of variation $({CV}^2)$. This design preserves SAM’s original capability, enhances modality-specific adaptability, and maintains a lightweight architecture.
  • Figure 2: Overview of the proposed SegMoTE framework. SegMoTE extends SAM by introducing a token-level expert routing mechanism to enable adaptive multimodal medical image segmentation. The frozen SAM encoder extracts modality-agnostic embedding representations, while the progressive prompt tokenization transforms latent features into semantically aligned feature tokens. These tokens interact with the decoder layers and MoTE for dynamic expert selection and adaptive token updates.
  • Figure 3: Architecture of the Progressive Prompt Tokenization. By randomly selecting mask and text prompts as foreground priors, the learnable query $Q$ captures the relationship between the foreground and background by performing attention on the normalized image features.
  • Figure 4: Feature distribution comparison across datasets. Feature embeddings extracted from the frozen SAM encoder are visualized after dimensionality reduction. The red boxes highlight regions where features exhibit overlap and unsmooth transitions. Compared with other datasets, MedSeg-HQ presents smoother and more continuous feature distributions, indicating higher consistency.
  • Figure 5: In-domain segmentation results across datasets. (a) and (b) show the Dice coefficient comparisons under single click and bounding box interactions, respectively. (c) illustrates the training dataset sizes used by different methods, where SegMoTE achieves performance improvement by optimizing the annotation quality of the data.
  • ...and 2 more figures