Table of Contents
Fetching ...

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

Leyang Shen, Gongwei Chen, Rui Shao, Weili Guan, Liqiang Nie

TL;DR

This work addresses the persistent gap where generalist multimodal LLMs underperform specialist models due to task interference across vision-language tasks. It introduces MoME, a dual-mixture architecture comprising MoVE for vision and MoLE for language, featuring an Adaptive Deformable Transformation to harmonize heterogeneous vision features and sparsely gated adapters to route language processing. The approach demonstrates strong, task-dependent specialization through instance-level routing for vision and sentence-embedding-guided routing for language, validated on 24 VL datasets with notable gains over baselines. The result is a scalable framework that enhances generalist VL understanding while maintaining comparable computation, with publicly released code paving the way for broader adoption in multimodal AI research.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

TL;DR

This work addresses the persistent gap where generalist multimodal LLMs underperform specialist models due to task interference across vision-language tasks. It introduces MoME, a dual-mixture architecture comprising MoVE for vision and MoLE for language, featuring an Adaptive Deformable Transformation to harmonize heterogeneous vision features and sparsely gated adapters to route language processing. The approach demonstrates strong, task-dependent specialization through instance-level routing for vision and sentence-embedding-guided routing for language, validated on 24 VL datasets with notable gains over baselines. The result is a scalable framework that enhances generalist VL understanding while maintaining comparable computation, with publicly released code paving the way for broader adoption in multimodal AI research.

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities across various vision-language tasks. However, a generalist MLLM typically underperforms compared with a specialist MLLM on most VL tasks, which can be attributed to task interference. In this paper, we propose a mixture of multimodal experts (MoME) to mitigate task interference and obtain a generalist MLLM. Our MoME is composed of two key components, a mixture of vision experts (MoVE) and a mixture of language experts (MoLE). MoVE can adaptively modulate the features transformed from various vision encoders, and has a strong compatibility in transformation architecture. MoLE incorporates sparsely gated experts into LLMs to achieve painless improvements with roughly unchanged inference costs. In response to task interference, our MoME specializes in both vision and language modality to adapt to task discrepancies. Extensive experiments show that MoME significantly improves the performance of generalist MLLMs across various VL tasks. The source code is released at https://github.com/JiuTian-VL/MoME
Paper Structure (25 sections, 7 equations, 8 figures, 4 tables)

This paper contains 25 sections, 7 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: VL data distribution visualization and model performance comparisons. Experimental results in (a) show that a generalist model trained on a mixed dataset underperforms most specialist models trained on separate task groups. The feature distributions visualized in (b) and (c) show significant discrepancies across VL tasks in both images and instructions.
  • Figure 2: The overall architecture of the proposed MoME. The model obtains compressed and self-enhanced visual features from distinct vision encoders through adaptive deformable transformation (a) and aggregates them by dynamic routing (b). The MoLE blocks (c) are integrated into each FFN layer of LLM to improve multitasking capability with little cost.
  • Figure 3: Comparison of MLLMs with different vision encoders.
  • Figure 4: Distribution of vision experts routing results. In each bar, the lengths of different colors represent the frequency with which each expert is selected.
  • Figure 5: Distribution of language experts routing results. The figures depict the expert load conditions of four selected datasets. In each bar, the lengths of different colors represent the frequency with which each expert is selected.
  • ...and 3 more figures