Table of Contents
Fetching ...

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Meng Cao, Yuyang Liu, Yingfei Liu, Tiancai Wang, Jiahua Dong, Henghui Ding, Xiangyu Zhang, Ian Reid, Xiaodan Liang

TL;DR

This work proposes a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations and proposes Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs.

Abstract

Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

TL;DR

This work proposes a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations and proposes Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs.

Abstract

Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.

Paper Structure

This paper contains 17 sections, 9 equations, 7 figures, 20 tables, 2 algorithms.

Figures (7)

  • Figure 1: COAST benchmark for continual instruction tuning including (a) domain-incremental, (b) capability-incremental, and (c) dataset-incremental learning settings.
  • Figure 2: (a) An overview of Continual LLaVA. The $i$-th input image of $t$-th task ${\bm{v}}_t^i$ is processed via the pre-trained visual encoder followed by a linear projection layer. The corresponding textual instruction ${\bm{s}}^i_t$ is embedded as ${\bm{q}}^i_t$ by a frozen surrogate function. The low-rank pool contains $N$ learnable proxy-increment embedding pairs $\{{\bm{k}}_n, {\bm{P}}_n\}_{n=1}^{N}$, where the dual increment embeddings are selected according to the cosine similarity with ${\bm{q}}^i_t$. (b) The schematic illustration of the dual increment embeddings. We construct intrinsic embeddings $\Delta\theta_t^i$ by aggregating the top-$M$ items from the low-rank pool based on their similarity to ${\bm{q}}^i_t$. Contextual increments $\Delta\delta_t^i$ are generated by integrating the selected embeddings from all the previous tasks via learnable weights.
  • Figure 3: Visualizations on reference QA and detail description tasks under the training chain of dcrf, i.e., desc$\rightarrow$conv$\rightarrow$reason$\rightarrow$referring qa. The incorrect or undesired responses are marked in red, while the remarkable contents are highlighted in green.
  • Figure 4: Illustrations of adaption positions including the query, key, value, and output linear projections. $\Delta\theta$ and $\Delta\delta$ denote intrinsic and contextual increment embeddings, respectively.
  • Figure 5: Visualization of forgetting (%) on each task for sequential training (left) and our Continual LLaVA (right) under different task orders.
  • ...and 2 more figures