Table of Contents
Fetching ...

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

Cheng Chen, Junchen Zhu, Xu Luo, Hengtao Shen, Lianli Gao, Jingkuan Song

TL;DR

A comprehensive benchmark is presented, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm and introduces MoELoRA to MLLMs which is effective to retain the previous instruction alignment.

Abstract

Instruction tuning represents a prevalent strategy employed by Multimodal Large Language Models (MLLMs) to align with human instructions and adapt to new tasks. Nevertheless, MLLMs encounter the challenge of adapting to users' evolving knowledge and demands. Therefore, how to retain existing skills while acquiring new knowledge needs to be investigated. In this paper, we present a comprehensive benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. CoIN comprises 10 commonly used datasets spanning 8 task categories, ensuring a diverse range of instructions and tasks. Besides, the trained model is evaluated from two aspects: Instruction Following and General Knowledge, which assess the alignment with human intention and knowledge preserved for reasoning, respectively. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting, and the failure in intention alignment assumes the main responsibility, instead of the knowledge forgetting. To this end, we introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment. Experimental results consistently illustrate the forgetting decreased from this method on CoIN.

CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model

TL;DR

A comprehensive benchmark is presented, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm and introduces MoELoRA to MLLMs which is effective to retain the previous instruction alignment.

Abstract

Instruction tuning represents a prevalent strategy employed by Multimodal Large Language Models (MLLMs) to align with human instructions and adapt to new tasks. Nevertheless, MLLMs encounter the challenge of adapting to users' evolving knowledge and demands. Therefore, how to retain existing skills while acquiring new knowledge needs to be investigated. In this paper, we present a comprehensive benchmark, namely Continual Instruction tuNing (CoIN), to assess existing MLLMs in the sequential instruction tuning paradigm. CoIN comprises 10 commonly used datasets spanning 8 task categories, ensuring a diverse range of instructions and tasks. Besides, the trained model is evaluated from two aspects: Instruction Following and General Knowledge, which assess the alignment with human intention and knowledge preserved for reasoning, respectively. Experiments on CoIN demonstrate that current powerful MLLMs still suffer catastrophic forgetting, and the failure in intention alignment assumes the main responsibility, instead of the knowledge forgetting. To this end, we introduce MoELoRA to MLLMs which is effective to retain the previous instruction alignment. Experimental results consistently illustrate the forgetting decreased from this method on CoIN.
Paper Structure (83 sections, 6 equations, 4 figures, 15 tables)

This paper contains 83 sections, 6 equations, 4 figures, 15 tables.

Figures (4)

  • Figure 1: Different behavior of MLLMs when sequentially tuned on CoIN. Blue represents the accuracy for each task evaluated when just tuned on the corresponding task, and Red represents the accuracy evaluated after the models have been sequentially tuned on all tasks. LLaVA liu2024visual and Qwen-VL DBLP:journals/corr/abs-2308-12966 suffer from catastrophic forgetting while MiniGPT-v2 DBLP:journals/corr/abs-2310-09478 does not. The sequential training starts clockwise from ScienceQA and ends with OCR-VQA.
  • Figure 1: Examples of instruction tuning data in our proposed CoIN, which contains diverse visual understanding and perception tasks, such as classification, referring expression comprehension and image question answering.
  • Figure 2: An overview of CoIN benchmark. A selected MLLM is sequentially fine-tuned on 8 instruction datasets spanning diverse tasks. Then, it is evaluated from two perspectives: Truth Alignment and Reasoning Capability, which assess the alignment with ground truth and knowledge preserved for reasoning, respectively. The evaluation example at the bottom presents the results of the model tested on classification after fine-tuning on each task.
  • Figure 3: The illustration of test examples from LLaVA after training on the last task, i.e. OCR-VQA.