Table of Contents
Fetching ...

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, Huawen Feng

TL;DR

This work analyzes how MCIT suffers from catastrophic forgetting and negative forward transfer due to cross-task embedding discrepancies. It introduces Fwd-Prompt, a gradient-projection-driven, multimodal prompt-tuning approach that partitions task-specific subspaces and reuses pre-trained knowledge to achieve anti-forgetting and positive forward transfer. Through a multimodal prompt pool and subspace projections, Fwd-Prompt demonstrates state-of-the-art performance with fewer trainable parameters and no rehearsal data across diverse vision-language tasks. The results highlight the practicality and scalability of continual instruction-tuning for MLLMs, suggesting promising directions for future MCIT research.

Abstract

Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. We discover a large discrepancy in different input embeddings by performing singular value decomposition (SVD) on input embeddings. This discrepancy results in the model learning irrelevant information for old and pre-trained tasks, leading to catastrophic forgetting and negative forward transfer. To address these issues, we propose Prompt Tuning with Positive Forward Transfer (Fwd-Prompt), a prompt-based method that projects the prompt gradient to the residual space to minimize interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research illuminates the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT.

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

TL;DR

This work analyzes how MCIT suffers from catastrophic forgetting and negative forward transfer due to cross-task embedding discrepancies. It introduces Fwd-Prompt, a gradient-projection-driven, multimodal prompt-tuning approach that partitions task-specific subspaces and reuses pre-trained knowledge to achieve anti-forgetting and positive forward transfer. Through a multimodal prompt pool and subspace projections, Fwd-Prompt demonstrates state-of-the-art performance with fewer trainable parameters and no rehearsal data across diverse vision-language tasks. The results highlight the practicality and scalability of continual instruction-tuning for MLLMs, suggesting promising directions for future MCIT research.

Abstract

Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. We discover a large discrepancy in different input embeddings by performing singular value decomposition (SVD) on input embeddings. This discrepancy results in the model learning irrelevant information for old and pre-trained tasks, leading to catastrophic forgetting and negative forward transfer. To address these issues, we propose Prompt Tuning with Positive Forward Transfer (Fwd-Prompt), a prompt-based method that projects the prompt gradient to the residual space to minimize interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research illuminates the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT.
Paper Structure (28 sections, 17 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 28 sections, 17 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Catastrophic forgetting and negative forward transfer in multimodal continual instruction-tuning. InstructBLIP dai2023instructblip is sequentially instruction-tuned on Flickr30k, VizWiz, TextVQA, and GQA.
  • Figure 2: The contour of the distribution of input embeddings from four different tasks. "Initial" and "DirectIT" represent the input embeddings of InstructBLIP before and after instruction tuning on each task, respectively. (a) Flickr30k. (b) VizWiz. (c) TextVQA. (d) GQA.
  • Figure 3: The rank of input embeddings in five scenarios. "Initial": load InstructBLIP without training; "DirectIT": train InstructBLIP on the four datasets respectively; "MultiTask": train InstructBLIP on all four datasets jointly; "SEQ": train InstructBLIP on the four datasets sequentially; "ER": store 1 % of old instances and train InstructBLIP sequentially using the experience replay strategy. "Initial" is a starting point of MCIT. "DirectIT" and "MultiTask" are both the upper bound of MCIT and "SEQ" is the lower bound. "ER" is a popular technique for alleviating forgetting.
  • Figure 4: An overview of Fwd-Prompt. The multimodal prompt pool and gradient projection details are in Sections \ref{['sec:prompt-pool']} and \ref{['sec:gradient-project']}, respectively. We provide multimodal prompt pool and gradient projection illustrations in Figures \ref{['fig:method_joint_similarity']}, \ref{['fig:method_gradient_projection']}, respectively.
  • Figure 5: An illustration of learning a multimodal prompt pool with joint similarity. The key intuition for building a multimodal prompt pool is that both image and text instruction should determine each prompt. For example, we expect the model to select different prompts when different text instructions are provided for the same input image.
  • ...and 5 more figures