Table of Contents
Fetching ...

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning

Xinyuan Gao, Songlin Dong, Yuhang He, Qiang Wang, Yihong Gong

TL;DR

Continual learning without rehearsal remains challenging due to domain shifts and prompt-matching pitfalls. This paper presents Continual Adapter (C-ADA), a parallel, parameter-extensible CAL plus a lightweight Scale-and-Shift module to adapt a frozen pre-trained backbone for RFCL, aided by an orthogonal loss to minimize interference. The method achieves SOTA results on class- and domain-incremental benchmarks with significantly faster training due to a single forward pass and reduced parameter counts. It demonstrates robustness across settings and preserves privacy by not storing old data. The approach offers a practical, scalable path for continual learning with large pre-trained models.

Abstract

The problem of Rehearsal-Free Continual Learning (RFCL) aims to continually learn new knowledge while preventing forgetting of the old knowledge, without storing any old samples and prototypes. The latest methods leverage large-scale pre-trained models as the backbone and use key-query matching to generate trainable prompts to learn new knowledge. However, the domain gap between the pre-training dataset and the downstream datasets can easily lead to inaccuracies in key-query matching prompt selection when directly generating queries using the pre-trained model, which hampers learning new knowledge. Thus, in this paper, we propose a beyond prompt learning approach to the RFCL task, called Continual Adapter (C-ADA). It mainly comprises a parameter-extensible continual adapter layer (CAL) and a scaling and shifting (S&S) module in parallel with the pre-trained model. C-ADA flexibly extends specific weights in CAL to learn new knowledge for each task and freezes old weights to preserve prior knowledge, thereby avoiding matching errors and operational inefficiencies introduced by key-query matching. To reduce the gap, C-ADA employs an S&S module to transfer the feature space from pre-trained datasets to downstream datasets. Moreover, we propose an orthogonal loss to mitigate the interaction between old and new knowledge. Our approach achieves significantly improved performance and training speed, outperforming the current state-of-the-art (SOTA) method. Additionally, we conduct experiments on domain-incremental learning, surpassing the SOTA, and demonstrating the generality of our approach in different settings.

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning

TL;DR

Continual learning without rehearsal remains challenging due to domain shifts and prompt-matching pitfalls. This paper presents Continual Adapter (C-ADA), a parallel, parameter-extensible CAL plus a lightweight Scale-and-Shift module to adapt a frozen pre-trained backbone for RFCL, aided by an orthogonal loss to minimize interference. The method achieves SOTA results on class- and domain-incremental benchmarks with significantly faster training due to a single forward pass and reduced parameter counts. It demonstrates robustness across settings and preserves privacy by not storing old data. The approach offers a practical, scalable path for continual learning with large pre-trained models.

Abstract

The problem of Rehearsal-Free Continual Learning (RFCL) aims to continually learn new knowledge while preventing forgetting of the old knowledge, without storing any old samples and prototypes. The latest methods leverage large-scale pre-trained models as the backbone and use key-query matching to generate trainable prompts to learn new knowledge. However, the domain gap between the pre-training dataset and the downstream datasets can easily lead to inaccuracies in key-query matching prompt selection when directly generating queries using the pre-trained model, which hampers learning new knowledge. Thus, in this paper, we propose a beyond prompt learning approach to the RFCL task, called Continual Adapter (C-ADA). It mainly comprises a parameter-extensible continual adapter layer (CAL) and a scaling and shifting (S&S) module in parallel with the pre-trained model. C-ADA flexibly extends specific weights in CAL to learn new knowledge for each task and freezes old weights to preserve prior knowledge, thereby avoiding matching errors and operational inefficiencies introduced by key-query matching. To reduce the gap, C-ADA employs an S&S module to transfer the feature space from pre-trained datasets to downstream datasets. Moreover, we propose an orthogonal loss to mitigate the interaction between old and new knowledge. Our approach achieves significantly improved performance and training speed, outperforming the current state-of-the-art (SOTA) method. Additionally, we conduct experiments on domain-incremental learning, surpassing the SOTA, and demonstrating the generality of our approach in different settings.
Paper Structure (32 sections, 7 equations, 4 figures, 5 tables)

This paper contains 32 sections, 7 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Prior prompt-based approaches pass through the pre-trained model to generate the query and employ key-query matching to select prompts, which are inserted into the pre-trained model again (each layer has a unique prompt). Continual Adapter approach (C-ADA) strategy eliminates the need for key-query matching by introducing the CAL and S&S, which brings significant improvement in learning the new knowledge. Moreover, C-ADA only needs to pass the pre-trained model once, highlighting the training speed.
  • Figure 2: The framework of our C-ADA approach. For simplicity, we omit the skip connection and Layernorm in the figure. or loss represents the orthogonal loss. We attach our S&S and CAL, which are in parallel with the projection layer and MLP, to the shallow N layers of ViT and freeze the pre-trained backbone. For each new task, we expand two trainable weights in the CAL to learn the new knowledge and freeze the previous weights. Different from prior works, our approach uses a novel adapter variant to eliminate the necessity for key-query matching. Only trainable weights and classifier parameters are optimized which is parameter efficient and no old information (old images or prototypes) are stored which is privacy preserving.
  • Figure 3: Average accuracy $A_N$ (%) vs tuning parameters (%). We report the results on the 20-tasks ImageNet-R.
  • Figure 4: Ablation Results (%) on 10-task ImageNet-R and CIFAR-100. $A_N$ gives the accuracy averaged over tasks. We ablate the key components in turn and report the results.