Table of Contents
Fetching ...

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

Songtao Jiang, Tuo Zheng, Yan Zhang, Yeying Jin, Li Yuan, Zuozhu Liu

TL;DR

Comprehensive experiments on both open- and close-end medical question answering and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that the proposed Med-MoE model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters.

Abstract

Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models

TL;DR

Comprehensive experiments on both open- and close-end medical question answering and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that the proposed Med-MoE model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters.

Abstract

Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
Paper Structure (14 sections, 7 equations, 11 figures, 13 tables)

This paper contains 14 sections, 7 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Upper: This figure showcases our model's capability in addressing three primary types of Medical VQA challenges and image classification tasks. Lower: Comparison between Med-MoE and LLaVA-Med, emphasizing Med-MoE's advantages in inference speed, model size, and its superior performance.
  • Figure 2: The framework of Med-MoE with three phases.
  • Figure 3: Visualization of task embeddings and performance using routers under varied settings. Silhouette score (sil. score) denotes superior task differentiation. Supplementary Figure \ref{['fig:router_sup']} illustrates Phi2's embeddings.
  • Figure 4: Visualization of expert specialization in processing image and text tokens under the MRI modality. Results for other modalities are in Supplementary Figure \ref{['fig:other_image_token']}.
  • Figure 5: Upper: Expert activations for four modalities handled by our router of Med-MoE in each MoE layer. Lower: Expert activations for four modalities handled by the standard learned router in each MoE layer.
  • ...and 6 more figures