Table of Contents
Fetching ...

Continual Instruction Tuning for Large Multimodal Models

Jinghan He, Haiyun Guo, Ming Tang, Jinqiao Wang

TL;DR

This work analyzes continual instruction tuning for large multimodal models, highlighting persistent forgetting under sequential task learning and the relative strengths of replay-based and model-expansion CL methods. It introduces two CIT benchmarks, explores regularization- and expansion-based strategies, and demonstrates that task similarity can boost anti-forgetting and transfer when incorporated into regularization and expansion approaches. The findings suggest practical guidelines for deploying CIT in evolving vision-language ecosystems and point to task-similarity-aware methods as a key lever for robust continual learning. Overall, the paper advances understanding of how to maintain instruction-following performance in LMMs as new tasks arrive.

Abstract

Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.

Continual Instruction Tuning for Large Multimodal Models

TL;DR

This work analyzes continual instruction tuning for large multimodal models, highlighting persistent forgetting under sequential task learning and the relative strengths of replay-based and model-expansion CL methods. It introduces two CIT benchmarks, explores regularization- and expansion-based strategies, and demonstrates that task similarity can boost anti-forgetting and transfer when incorporated into regularization and expansion approaches. The findings suggest practical guidelines for deploying CIT in evolving vision-language ecosystems and point to task-similarity-aware methods as a key lever for robust continual learning. Overall, the paper advances understanding of how to maintain instruction-following performance in LMMs as new tasks arrive.

Abstract

Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks are constantly being created in practice. Instead of always re-training LMMs when new tasks arrive, continual learning offers flexibility for models to continually and efficiently exploit the evolving data. This work aims to explore the following two questions: 1) Do LMMs still suffer from catastrophic forgetting in continual instruction tuning? 2) Are the existing three classes of continual learning methods still applicable to the continual instruction tuning of LMMs? An extensive study is conducted to address the above questions. First, we establish the first benchmark in this setting and reveal that catastrophic forgetting is still observed when continually instruction-tuning LMMs. However, the multi-task joint instruction tuning can facilitate the model's continual learning ability and mitigate forgetting. Second, we integrate and adapt classic continual learning methods to our context, demonstrating the efficacy of data replay and model expansion strategies across diverse scenarios. In contrast, regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Third, we delve into the correlation and forgetting dynamics between vision-language task pairs and propose task-similarity-informed regularization and model expansion methods for continual instruction tuning of LMMs. Experimental results show that our approach consistently boosts the model's performance.
Paper Structure (31 sections, 10 equations, 6 figures, 9 tables, 2 algorithms)

This paper contains 31 sections, 10 equations, 6 figures, 9 tables, 2 algorithms.

Figures (6)

  • Figure 1: Analysis of forgetting in naive sequential instruction tuning. Continual instruction tuning starts from BLIP2 in benchmark 1 (Flickr30k$\rightarrow$TextCaps$\rightarrow$VQA v2$\rightarrow$OCR-VQA$\rightarrow$GQA) and InstructBLIP in benchmark 2 (Multi-task$\rightarrow$Flickr30k$\rightarrow$VvizWiz$\rightarrow$TextVQA$\rightarrow$GQA). Start stands for the phase in which the model has just been tuned on the task, while end is the phase when the model finishes continual learning on all tasks. Greater differences between the bars of two colors indicate more severe forgetting on this task.
  • Figure 2: Illustration of continual instruction tuning benchmarks. We conduct continual instruction tuning on LMMs trained with or without task 0 and aim to explore whether multi-task joint instruction tuning improves the model's continual learning ability as well as the differences in the applicability of the continual learning methods between the two cases.
  • Figure 3: Illustration of task similarity-informed regularization and model expansion methods. Text and image in the instruction tuning data are passed through their corresponding task encoders to get the task embeddings, respectively. Different colors of task embeddings correspond to different tasks. Similarity scores of the new task to all the old tasks are obtained by fusing the similarity regarding the image, text input, and text output. The obtained similarity score can be used for adaptive weighting of parameter importance in regularization-based methods and for the selection or reuse of task-specific modules in model expansion methods.
  • Figure 4: Performance on each known task after the final stage of continual instruction tuning. Different colors represent different methods. Initial indicates the zero-shot performance of the model prior to continual instruction tuning. The numerical values of the results for Initial and DirectFT are labeled in the charts.
  • Figure 5: Examples of catastrohpic forgetting. We show the responses given by models trained on different datasets when given a test image from Flickr30k for the image captioning task.
  • ...and 1 more figures