Table of Contents
Fetching ...

MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu, Kun Han, Pooya Khosravi, Xiaohui Xie

TL;DR

This work tackles medical image segmentation under heterogeneous, partially labeled multi-dataset settings by introducing MoME, a Mixture of Visual Language Medical Experts. MoME combines a vision branch with multi-scale decoder-layer experts, a text branch that provides CLIP-based semantic embeddings, and a text-guided router to perform pixel-wise fusion across scales, producing class-specific segmentation heads. The model is trained on 10 public CT datasets (3,410 scans) and demonstrates state-of-the-art Dice scores, strong tumor-detection metrics, and robust generalization to external datasets and unseen organs. The results suggest that integrating MoE with vision-language cues yields robust MIS with broad transferability and lays groundwork for further expansions, such as treating whole models as additional experts.

Abstract

In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation

TL;DR

This work tackles medical image segmentation under heterogeneous, partially labeled multi-dataset settings by introducing MoME, a Mixture of Visual Language Medical Experts. MoME combines a vision branch with multi-scale decoder-layer experts, a text branch that provides CLIP-based semantic embeddings, and a text-guided router to perform pixel-wise fusion across scales, producing class-specific segmentation heads. The model is trained on 10 public CT datasets (3,410 scans) and demonstrates state-of-the-art Dice scores, strong tumor-detection metrics, and robust generalization to external datasets and unseen organs. The results suggest that integrating MoE with vision-language cues yields robust MIS with broad transferability and lays groundwork for further expansions, such as treating whole models as additional experts.

Abstract

In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.

Paper Structure

This paper contains 19 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The architecture of MoME employs a text branch to generate CLIP embeddings for organs and tumors. Subsequently, a gating network is employed to intelligently direct the inputs from multiple experts, enabling effective integration of diverse visual features.
  • Figure 2: Qualitative analysis of various methods on the BTCV dataset: Column (a) displays the ground truth. Columns (b), (c), and (e) display results from earlier techniques, while (d) and (f) show MoME's performance. Orange rectangles highlight our model's effectiveness.