MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation
Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu, Kun Han, Pooya Khosravi, Xiaohui Xie
TL;DR
This work tackles medical image segmentation under heterogeneous, partially labeled multi-dataset settings by introducing MoME, a Mixture of Visual Language Medical Experts. MoME combines a vision branch with multi-scale decoder-layer experts, a text branch that provides CLIP-based semantic embeddings, and a text-guided router to perform pixel-wise fusion across scales, producing class-specific segmentation heads. The model is trained on 10 public CT datasets (3,410 scans) and demonstrates state-of-the-art Dice scores, strong tumor-detection metrics, and robust generalization to external datasets and unseen organs. The results suggest that integrating MoE with vision-language cues yields robust MIS with broad transferability and lays groundwork for further expansions, such as treating whole models as additional experts.
Abstract
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
