Table of Contents
Fetching ...

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen

TL;DR

CuMo tackles the high computational cost of scaling multimodal LLMs by injecting sparse Top-K MoE blocks into the vision encoder and MLP connector, rather than expanding the LLM alone. It employs a three-stage training pipeline with co-upcycling, initializing MoE experts from pre-trained MLP blocks and using auxiliary balance losses to stabilize learning. Across open-source datasets, CuMo achieves state-of-the-art or competitive results within each model-size group on VQA and visual-instruction benchmarks, validating its vision-side scaling approach. The work highlights practical impact for scalable, efficient multimodal reasoning and provides open-source code and weights to enable broader adoption.

Abstract

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

TL;DR

CuMo tackles the high computational cost of scaling multimodal LLMs by injecting sparse Top-K MoE blocks into the vision encoder and MLP connector, rather than expanding the LLM alone. It employs a three-stage training pipeline with co-upcycling, initializing MoE experts from pre-trained MLP blocks and using auxiliary balance losses to stabilize learning. Across open-source datasets, CuMo achieves state-of-the-art or competitive results within each model-size group on VQA and visual-instruction benchmarks, validating its vision-side scaling approach. The work highlights practical impact for scalable, efficient multimodal reasoning and provides open-source code and weights to enable broader adoption.

Abstract

Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks. However, these scaling approaches are computationally expensive and overlook the significance of improving model capabilities from the vision side. Inspired by the successful applications of Mixture-of-Experts (MoE) in LLMs, which improves model scalability during training while keeping inference costs similar to those of smaller models, we propose CuMo. CuMo incorporates Co-upcycled Top-K sparsely-gated Mixture-of-experts blocks into both the vision encoder and the MLP connector, thereby enhancing the multimodal LLMs with minimal additional activated parameters during inference. CuMo first pre-trains the MLP blocks and then initializes each expert in the MoE block from the pre-trained MLP block during the visual instruction tuning stage. Auxiliary losses are used to ensure a balanced loading of experts. CuMo outperforms state-of-the-art multimodal LLMs across various VQA and visual-instruction-following benchmarks using models within each model size group, all while training exclusively on open-sourced datasets. The code and model weights for CuMo are open-sourced at https://github.com/SHI-Labs/CuMo.
Paper Structure (19 sections, 5 equations, 7 figures, 10 tables)

This paper contains 19 sections, 5 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Comparisons of CuMo Mistral-7B with state-of-the-art 7B multimodal LLMs. CuMo outperforms strong open-sourced models such as Mini-Gemini and LLaVA-NeXT, as well as the private MM1 model.
  • Figure 2: Architecture of CuMo. CuMo incorporates sparse Top-K MoE blocks into the CLIP vision encoder and vision-language MLP connector, thereby improving the multimodal LLM capabilities from the vision side. Skip connections are omitted for simplicity. Further implementation details are provided in Section \ref{['3.2']}.
  • Figure 3: Initialization of MoE blocks via Co-Upcycling. Each MLP expert within the MoE block during the visual instruction tuning stage is initialized from the corresponding pre-trained MLP.
  • Figure 4: Training Stages of CuMo. The first stage involves pre-training the MLP for better alignment. Subsequently, the pre-finetuning stage trains all parameters as a warm-up before the next stage. Finally, the MLP experts within each MoE block are initialized from the weights of the corresponding MLP block, followed by training all parameters in the visual instruction tuning stage.
  • Figure 5: Expert distributions of MoE blocks in CLIP. We select layers from CLIP and summarize the activated experts during the feed-forward process on the MME test set.
  • ...and 2 more figures