Table of Contents
Fetching ...

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao

TL;DR

The paper tackles the challenge of task interference in multimodal large language models (MLLMs) as more modalities and tasks are introduced. It introduces Octavius, a framework that combines Mixture-of-Experts with LoRA (LoRA-MoE) and instance-based gating to route knowledge to task- and modality-specific experts, paired with modality encoders for images and 3D point clouds. The approach yields around 20% performance gains across diverse 2D and 3D tasks while keeping parameter overhead low. The work demonstrates improved robustness to interference in multi-modal instruction tuning and provides a scalable path toward embodied AI applications with richer perceptual inputs.

Abstract

Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/tutorial/.

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

TL;DR

The paper tackles the challenge of task interference in multimodal large language models (MLLMs) as more modalities and tasks are introduced. It introduces Octavius, a framework that combines Mixture-of-Experts with LoRA (LoRA-MoE) and instance-based gating to route knowledge to task- and modality-specific experts, paired with modality encoders for images and 3D point clouds. The approach yields around 20% performance gains across diverse 2D and 3D tasks while keeping parameter overhead low. The work demonstrates improved robustness to interference in multi-modal instruction tuning and provides a scalable path toward embodied AI applications with richer perceptual inputs.

Abstract

Recent studies have demonstrated Large Language Models (LLMs) can extend their zero-shot generalization capabilities to multimodal learning through instruction tuning. As more modalities and downstream tasks are introduced, negative conflicts and interference may have a worse impact on performance. While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs). Specifically, we combine the well-known Mixture-of-Experts (MoE) and one of the representative PEFT techniques, i.e., LoRA, designing a novel LLM-based decoder, called LoRA-MoE, for multimodal learning. To the best of our knowledge, we are one of the pioneering efforts to introduce MoE into MLLMs to address this problem. The experimental results (about 20% improvement) have shown the effectiveness and versatility of our design in various 2D and 3D downstream tasks. Code and datasets are available at https://openlamm.github.io/tutorial/.
Paper Structure (17 sections, 10 equations, 13 figures, 12 tables)

This paper contains 17 sections, 10 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Octavius is a unified, multimodal large language model with a novel capability to comprehend various tasks across different modalities, including but not limited to 2D captioning, 2D detection, 3D VQA, and 3D dense captioning.
  • Figure 2: Overall pipeline of Octavius. We design corresponding encoders for different modalities, with the primary objective of empowering the LLMs to gain a deeper understanding of visual features. Additionally, we propose a dynamic gating network that selects distinct LoRA experts based on input instructions, thereby proficiently mitigating interference arising from multimodal learning.
  • Figure 3: We conduct a simple pilot study on PASCAL VOC and ScienceQA to demonstrate the tug-of-war problem and the effectiveness of our proposed LoRA-MoE. Recall@0.5 denotes recall at an IoU threshold of 0.5, respectively.
  • Figure 4: We follow previous works, e.g., LAMM yin2023lamm, LLaVA liu2023llava, to apply an instruction-following training pipeline.
  • Figure 5: Structure of Object-As-Scene. To acquire scene-level features, we follow a three-step process. Firstly, we obtain RoIs from a given point cloud using a pre-trained detector. Next, we pre-train a Point-Bert model following a ULIP-like pipeline and employ it to extract instance-level 3D features. Finally, by aggregating features from visual embedding, we derive the final scene-level feature.
  • ...and 8 more figures