Table of Contents
Fetching ...

MoExtend: Tuning New Experts for Modality and Task Extension

Shanshan Zhong, Shanghua Gao, Zhongzhan Huang, Wushao Wen, Marinka Zitnik, Pan Zhou

TL;DR

MoExtend tackles the challenge of extending large language models to new modalities without catastrophic forgetting by introducing a three-stage pipeline that aligns a vision encoder to text, selectively extends MoE layers with new experts, and fine-tunes only the extension parts along with calibration modules. By freezing pretrained MoE and vision encoders, it achieves rapid adaptation (about $\sim$6x faster than full fine-tuning) while maintaining performance on vision-language benchmarks at parity with larger models using far fewer active parameters. The approach demonstrates strong image understanding and multimodal capabilities across standard benchmarks, with ablations showing that adding experts to roughly half of the layers and using data-driven layer selection yields competitive results and controlled forgetting. This work offers a practical, cost-effective path to scalable multimodal LLMs and opens avenues for extending MoE-based models to diverse modalities beyond vision. The codebase is available at the provided URL, enabling reproducibility and further exploration in multimodal MoE architectures.

Abstract

Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: https://github.com/zhongshsh/MoExtend.

MoExtend: Tuning New Experts for Modality and Task Extension

TL;DR

MoExtend tackles the challenge of extending large language models to new modalities without catastrophic forgetting by introducing a three-stage pipeline that aligns a vision encoder to text, selectively extends MoE layers with new experts, and fine-tunes only the extension parts along with calibration modules. By freezing pretrained MoE and vision encoders, it achieves rapid adaptation (about 6x faster than full fine-tuning) while maintaining performance on vision-language benchmarks at parity with larger models using far fewer active parameters. The approach demonstrates strong image understanding and multimodal capabilities across standard benchmarks, with ablations showing that adding experts to roughly half of the layers and using data-driven layer selection yields competitive results and controlled forgetting. This work offers a practical, cost-effective path to scalable multimodal LLMs and opens avenues for extending MoE-based models to diverse modalities beyond vision. The codebase is available at the provided URL, enabling reproducibility and further exploration in multimodal MoE architectures.

Abstract

Large language models (LLMs) excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: https://github.com/zhongshsh/MoExtend.
Paper Structure (16 sections, 10 equations, 5 figures, 6 tables)

This paper contains 16 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: MoExtend consists of three stages: (a) Alignment Stage: we add a trainable MLP for pretrain vision encoder and tune the added MLP using image-caption data to achieve modal alignment; (b) Extension Stage: Determining which MoE layers need extension using an Extender; (c) Fine-tuning Stage: Fine-tuning the added extension part using a given Instruction dataset while keeping other parameters frozen. The "Other layer" represents other neural network components besides the MoE layer, including normalisation, self-attention layer, etc.
  • Figure 2: (Left) Original MoE layer; (Right) The extension part includes an additional expert FFN$_{m+1}$ and a corresponding column of trainable matrix parameters in the Router. Each expert is equipped with a learnable lightweight calibration module to correct gate weights altered due to the increased number of experts.
  • Figure 3: Left: std. $d_i$ of per layer caculated by Eq. (\ref{['eq:std']}). Layers in orange color (layer id: 3, 4, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 20, 21, 26, 28) are added new experts while layers in blue color are not with additional experts. Right: loss of MoExtend with by placing new expert layers in different positions. Employing our position selection scheme, we achieve faster convergence speeds compared to other manually designed schemes.
  • Figure 4: Distribution of expert selection per layer with different router initial methods. We randomly select 10,000 multimodal samples from LLaVA 1.5-mix-665k as inputs and count the number of times each expert at each layer is selected. To streamline the visualization of results, we calculate and visualize the proportion of five experts.
  • Figure 5: Structure of different types of calibration modules. The green modules represent calibration modules, and $m$ is the number of experts. The output of the calibration module acts on the softmax output of the router to correct the probability distribution effect caused by changes in the number of experts, ensuring proper gate weight adjustments for each expert.