Table of Contents
Fetching ...

A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Haihua Luo, Xuming Ran, Jiangrong Shen, Timo Hämäläinen, Zhonghua Chen, Qi Xu, Fengyu Cong

Abstract

Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).

A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Abstract

Incremental Learning (IL) aims to learn new tasks while preserving previously acquired knowledge. Integrating the zero-shot learning capabilities of pre-trained vision-language models into IL methods has marked a significant advancement. However, these methods face three primary challenges: (1) the need for improved training efficiency; (2) reliance on a memory bank to store previous data; and (3) the necessity of a strong backbone to augment the model's capabilities. In this paper, we propose SimE, a Simple and Efficient framework that employs a vision-language model with adapters designed specifically for the IL task. We report a remarkable phenomenon: there is a nonlinear correlation between the number of adaptive adapter connections and the model's IL capabilities. While increasing adapter connections between transformer blocks improves model performance, adding more adaptive connections within transformer blocks during smaller incremental steps does not enhance, and may even degrade the model's IL ability. Extensive experimental results show that SimE surpasses traditional methods by 9.6% on TinyImageNet and outperforms other CLIP-based methods by 5.3% on CIFAR-100. Furthermore, we conduct a systematic study to enhance the utilization of the zero-shot capabilities of CLIP. We suggest replacing SimE's encoder with a CLIP model trained on larger datasets (e.g., LAION2B) and stronger architectures (e.g., ViT-L/14).
Paper Structure (18 sections, 2 theorems, 25 equations, 8 figures, 12 tables)

This paper contains 18 sections, 2 theorems, 25 equations, 8 figures, 12 tables.

Key Result

Theorem 1

If then

Figures (8)

  • Figure 1: The overall framework of SimE. The green represents trainable and the grey denotes frozen components. A) illustrates the incremental learning tasks, which include t tasks. Specifically, we finetune the trainable parameters in SimE for task 1, while freezing all the parameters in SimE for the remaining tasks. B) The learning process for Task 1 can be divided into three stages: in the Adapter stage, the image encoder is finetuned using adapters; in the Prototype 1 stage, prototypes are computed based on the finetuned image encoder, and the classifier is updated; in the Test 1 stage, the classification performance of the model is evaluated. C) In the computation process for subsequent tasks $i$ ($1 < i < t$), all weights are frozen, only the prototypes are computed, and the classifier is updated. D) depicts the architectures of various image encoders.
  • Figure 2: Comparison of previous and current finetuning approaches: The previous approach, AdaptFormer (A), is contrasted with our Multi-Adapter finetuning (B, C, and D). The modules colored in green are trainable, while those in gray are frozen. In AdaptFormer and Multi-Adapter, the AdaptMLP, AdaptAtten, and AdaptAll modules are parameterized by a bottom-up bottleneck module with trainable parameters, whereas the original MLP and Self-Attention modules remain frozen. The AdaptFormer consists of the original frozen branch coupled with AdaptMLP. In contrast, our Multi-Adapter incorporates various trainable modules alongside the frozen branch for enhanced adaptability. And $B \times$ is represented by $B$ Blocks.
  • Figure 3: Comparison on the efficiency of different CIL methods. The dotted line and right axis coloured orange present the Last accuracy or Avg accuracy. (a)(b)(c) denote the Training parameters, GPU usage, Memory bank size and Last accuracy of different CIL methods respectively, (d)(e) is the Training parameters and Avg accuracy of Ours under different bottleneck dimensions and number of adapters. (f) show the comparison between Ours and other CIL methods in training parameters and Avg accuracy. All the experiments are conducted on CIFAR100, and (a)-(e) are conducted in 10steps.
  • Figure 4: Influence of adapters' position and number between transformer blocks. The x-axis represents the number of adapters in the encoder, with the numerical ranges indicating the positions of the adapters. For example, "1-3" signifies that adapters are inserted in the first 3 blocks. The accuracy shown in (c) and (d) represents the average results for different adapter positions with the same number of adapters. All results are based on CIFAR100.
  • Figure 5: The t-SNE visualization of CLIP pre-trained on different datasets. All results are conducted on CIFAR100 with ViT-B/16 as backbone containing 10 classes.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 1: Monotonic Bound on Performance
  • Theorem 2: Non-monotonic Actual Solutions