Table of Contents
Fetching ...

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei

TL;DR

This work tackles the computational burden of large vision-language models by introducing MoIIE, a sparse Mixture-of-Experts architecture that jointly models modality-specific (intra-modality) features and cross-modal (inter-modality) interactions using three expert groups. A two-stage training strategy aligns the visual and linguistic backbones and then jointly fine-tunes all components, enabling effective activation of both multimodal and MoE capabilities. Empirical results across 13 benchmarks and multiple backbones show that MoIIE consistently surpasses dense models and modality-only MoE variants, with substantial gains on knowledge-based QA and hallucination tasks, while using fewer activated parameters than competing open-source MoE-LVLMs. The proposed approach offers scalable, cost-efficient LVLMs that retain strong multimodal reasoning, and it demonstrates broad compatibility with existing LLM backbones, making it practically impactful for scalable multimodal AI systems.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

TL;DR

This work tackles the computational burden of large vision-language models by introducing MoIIE, a sparse Mixture-of-Experts architecture that jointly models modality-specific (intra-modality) features and cross-modal (inter-modality) interactions using three expert groups. A two-stage training strategy aligns the visual and linguistic backbones and then jointly fine-tunes all components, enabling effective activation of both multimodal and MoE capabilities. Empirical results across 13 benchmarks and multiple backbones show that MoIIE consistently surpasses dense models and modality-only MoE variants, with substantial gains on knowledge-based QA and hallucination tasks, while using fewer activated parameters than competing open-source MoE-LVLMs. The proposed approach offers scalable, cost-efficient LVLMs that retain strong multimodal reasoning, and it demonstrates broad compatibility with existing LLM backbones, making it practically impactful for scalable multimodal AI systems.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

Paper Structure

This paper contains 30 sections, 4 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Modality-specific MoE v.s. Our MoIIE.Left: The modality-specific MoE module routes text and image tokens exclusively to their respective specialized expert groups, limiting cross-modal associations such as the alignment between "dog" token and its corresponding image region. Right: Our MoIIE introduces intra-modality and inter-modality expert groups. Intra-modality experts (Expert V for image tokens, Expert L for text tokens) process modality-specific features while inter-modality experts (Expert S) process tokens from both modalities to model cross-modal interactions.
  • Figure 2: Method overview. Left: The MoIIE architecture, consisting of intra-modality experts (Expert V for image tokens and Expert L for text tokens) and inter-modality experts (Expert S) that process tokens from both modalities. Right: The two-stage training strategy. For each module, the tunable or frozen icon before the slash indicates the configuration during the first stage, while the icon after the slash represents the second-stage setup.
  • Figure 3: Performance variation across different visual instruction tuning data scales . MoIIE outperforms other architectural variants in achieving superior scaling efficiency.
  • Figure 4: Visualization of experts activated pathways. The figure shows the top-2 activated experts for text and image tokens, with Expert V and Expert L are intra-modality experts, Expert S are inter-modality experts.
  • Figure 5: Comparison between the Vanilla MoE and our MoIIE.Left: The vanilla MoE module Mixtral routes all modality tokens into a single group of experts. Right: The MoIIE module introduces intra-modality and inter-modality experts group. Intra-modality Expert V for image tokens and Expert L for text tokens modeling modality specfic features. Inter-modality experts (Expert S) that process tokens from both modalities, modeling cross-modal associations, capable of handling tokens from both modalities. The visual router routes image tokens to Expert V and Expert S, while the textual router routes text tokens to Expert L and Expert S.
  • ...and 1 more figures