MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models
Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei
TL;DR
This work tackles the computational burden of large vision-language models by introducing MoIIE, a sparse Mixture-of-Experts architecture that jointly models modality-specific (intra-modality) features and cross-modal (inter-modality) interactions using three expert groups. A two-stage training strategy aligns the visual and linguistic backbones and then jointly fine-tunes all components, enabling effective activation of both multimodal and MoE capabilities. Empirical results across 13 benchmarks and multiple backbones show that MoIIE consistently surpasses dense models and modality-only MoE variants, with substantial gains on knowledge-based QA and hallucination tasks, while using fewer activated parameters than competing open-source MoE-LVLMs. The proposed approach offers scalable, cost-efficient LVLMs that retain strong multimodal reasoning, and it demonstrates broad compatibility with existing LLM backbones, making it practically impactful for scalable multimodal AI systems.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.
