Table of Contents
Fetching ...

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

Jiazuo Yu, Haomiao Xiong, Lu Zhang, Haiwen Diao, Yunzhi Zhuge, Lanqing Hong, Dong Wang, Huchuan Lu, You He, Long Chen

TL;DR

PathWeave is proposed, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for $\mathbb{X}$-modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave

LLMs Can Evolve Continually on Modality for X-Modal Reasoning

TL;DR

PathWeave is proposed, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for -modal reasoning.

Abstract

Multimodal Large Language Models (MLLMs) have gained significant attention due to their impressive capabilities in multimodal understanding. However, existing methods rely heavily on extensive modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. In this paper, we propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities that enables MLLMs to continually EVolve on modalities for -modal reasoning. We leverage the concept of Continual Learning and develop an incremental training strategy atop pre-trained MLLMs, enabling their expansion to new modalities using uni-modal data, without executing joint-modal pretraining. In detail, a novel Adapter-in-Adapter (AnA) framework is introduced, in which uni-modal and cross-modal adapters are seamlessly integrated to facilitate efficient modality alignment and collaboration. Additionally, an MoE-based gating module is applied between two types of adapters to further enhance the multimodal interaction. To investigate the proposed method, we establish a challenging benchmark called Continual Learning of Modality (MCL), which consists of high-quality QA data from five distinct modalities: image, video, audio, depth and point cloud. Extensive experiments demonstrate the effectiveness of the proposed AnA framework on learning plasticity and memory stability during continual learning. Furthermore, PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%. Our code locates at https://github.com/JiazuoYu/PathWeave

Paper Structure

This paper contains 19 sections, 9 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Comparisons of Different Multimodal LLMs: (a) The normal multimodal methods han2023onellmzhao2023chatbridgechen2023x require unified sampling across multi-modal. (b) Our proposed incremental MLLMs learns each modality sequentially without joint-modal datasets.
  • Figure 2: Overall framework of PathWeave. We start from a pretrained vision LLM panagopoulou2023x and progressively expand new modalities on it without acquiring historical data. Given input samples from modalitym, we first exploit a frozen encoder ($E_m$) for feature extraction and leverage Q-Former to achieve multimodal alignment with LLMs. Then, the Adapter-in-Adapter (AnA) module is implemented in Q-Former to achieve flexible modal-path switching and expansion. In detail, the uni-modal adapters ($\mathcal{A}^m$) are implemented in parallel to facilitate new modal plasticity, which will be frozen once trained. While the cross-modal adapters ($\hat{\mathcal{A}}^m$) are formed by inserting a set of in-adapters ($\{\mathcal{F}_{i}^m\}_{i=1}^{m-1}$) into the learned uni-adapters to enhance the collaboration of historical knowledge. Additionally, an MoE-based gating module ($\mathcal{G}^m$) is implemented among uni-adapters to adaptively multimodal integration in input space.
  • Figure 3: Ablation study of the $\hat{T}_i^n$ performance for the $n$-$th$ dataset in modality $i$, which benefits from knowledge of different modalities. "Based on I-V-A-D" represents training point modality based on our pre-trained PathWeave that is trained in the sequence of image, video, audio, and depth.
  • Figure 4: Qualitative results of our method on each modality after continuous training.
  • Figure A5: More qualitative results of our method on each modality after continuous training.