Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE
Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao
TL;DR
The paper tackles the challenge of task interference in multimodal large language models (MLLMs) as more modalities and tasks are introduced. It introduces Octavius, a framework that combines Mixture-of-Experts with LoRA (LoRA-MoE) and instance-based gating to route knowledge to task- and modality-specific experts, paired with modality encoders for images and 3D point clouds. The approach yields around 20% performance gains across diverse 2D and 3D tasks while keeping parameter overhead low. The work demonstrates improved robustness to interference in multi-modal instruction tuning and provides a scalable path toward embodied AI applications with richer perceptual inputs.
Abstract
Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/tutorial/.
