Table of Contents
Fetching ...

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning

Huiyi Wang, Haodong Lu, Lina Yao, Dong Gong

TL;DR

The paper addresses continual learning with non-stationary data streams under a pre-trained transformer backbone. It introduces Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), which automatically reuses or expands modular adapters on demand, guided by representation descriptors and a learnable expandable weighting router. The method maintains the backbone frozen, adds adapters at targeted layers only when distribution shifts are detected, and trains a router to mix adapters, achieving sub-linear growth and removing the need for memory rehearsal. Empirical results on ViT-based CL benchmarks (CIFAR-100, ImageNet-R, ImageNet-A, VTAB) demonstrate state-of-the-art performance and robust knowledge reuse, with demonstrated efficiency through parallel training of adapters and descriptors. Overall, SEMA provides a scalable, on-demand adaptation framework that enhances stability-plasticity balance in continual learning for large pre-trained models.

Abstract

Continual learning (CL) aims to continually accumulate knowledge from a non-stationary data stream without catastrophic forgetting of learned knowledge, requiring a balance between stability and adaptability. Relying on the generalizable representation in pre-trained models (PTMs), PTM-based CL methods perform effective continual adaptation on downstream tasks by adding learnable adapters or prompts upon the frozen PTMs. However, many existing PTM-based CL methods use restricted adaptation on a fixed set of these modules to avoid forgetting, suffering from limited CL ability. Periodically adding task-specific modules results in linear model growth rate and impaired knowledge reuse. We propose Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. SEMA automatically decides to reuse or add adapter modules on demand in CL, depending on whether significant distribution shift that cannot be handled is detected at different representation levels. We design modular adapter consisting of a functional adapter and a representation descriptor. The representation descriptors are trained as a distribution shift indicator and used to trigger self-expansion signals. For better composing the adapters, an expandable weighting router is learned jointly for mixture of adapter outputs. SEMA enables better knowledge reuse and sub-linear expansion rate. Extensive experiments demonstrate the effectiveness of the proposed self-expansion method, achieving state-of-the-art performance compared to PTM-based CL methods without memory rehearsal. Code is available at https://github.com/huiyiwang01/SEMA-CL.

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning

TL;DR

The paper addresses continual learning with non-stationary data streams under a pre-trained transformer backbone. It introduces Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), which automatically reuses or expands modular adapters on demand, guided by representation descriptors and a learnable expandable weighting router. The method maintains the backbone frozen, adds adapters at targeted layers only when distribution shifts are detected, and trains a router to mix adapters, achieving sub-linear growth and removing the need for memory rehearsal. Empirical results on ViT-based CL benchmarks (CIFAR-100, ImageNet-R, ImageNet-A, VTAB) demonstrate state-of-the-art performance and robust knowledge reuse, with demonstrated efficiency through parallel training of adapters and descriptors. Overall, SEMA provides a scalable, on-demand adaptation framework that enhances stability-plasticity balance in continual learning for large pre-trained models.

Abstract

Continual learning (CL) aims to continually accumulate knowledge from a non-stationary data stream without catastrophic forgetting of learned knowledge, requiring a balance between stability and adaptability. Relying on the generalizable representation in pre-trained models (PTMs), PTM-based CL methods perform effective continual adaptation on downstream tasks by adding learnable adapters or prompts upon the frozen PTMs. However, many existing PTM-based CL methods use restricted adaptation on a fixed set of these modules to avoid forgetting, suffering from limited CL ability. Periodically adding task-specific modules results in linear model growth rate and impaired knowledge reuse. We propose Self-Expansion of pre-trained models with Modularized Adaptation (SEMA), a novel approach to enhance the control of stability-plasticity balance in PTM-based CL. SEMA automatically decides to reuse or add adapter modules on demand in CL, depending on whether significant distribution shift that cannot be handled is detected at different representation levels. We design modular adapter consisting of a functional adapter and a representation descriptor. The representation descriptors are trained as a distribution shift indicator and used to trigger self-expansion signals. For better composing the adapters, an expandable weighting router is learned jointly for mixture of adapter outputs. SEMA enables better knowledge reuse and sub-linear expansion rate. Extensive experiments demonstrate the effectiveness of the proposed self-expansion method, achieving state-of-the-art performance compared to PTM-based CL methods without memory rehearsal. Code is available at https://github.com/huiyiwang01/SEMA-CL.
Paper Structure (32 sections, 6 equations, 13 figures, 15 tables)

This paper contains 32 sections, 6 equations, 13 figures, 15 tables.

Figures (13)

  • Figure 1: An example of the self-expansion process. (a) The PTM (i.e., ViT) with $L$ transformer layers at the initial point of CL. (b) The first session adaptation -- at Task 1, a modular adapter and a (dummy) router is added and trained in each transformer layer. (c) The modular adapters and routers added in the previous step (Task 1) are frozen to alleviate forgetting. When Task 2 arrives, only the representation descriptor in the $L$-th layer detects feature distribution shift (with novel patterns) and generates expansion signal. A new module is added and trained in the $L$-th layer, with the router expanded and updated. (d) At Task 3, new adapter is added at $L-1$-th layer after the expansion signal is firstly generated. In this demo example, the expansion is triggered and produced again in the $L$-th layer, following the expansion in the $L-1$-th layer. If a task does not trigger expansion signal in any layer (implying no significantly different pattern), expansion would not happen, and existing adapters would be reused. More discussions are in Appendix \ref{['supp:training_procedure']}.
  • Figure 2: Overview of the model architecture. (a) shows the structure of expandable adapter modules with adapters, RDs and router. (b) shows the scenario where expansion is triggered by representations with distribution different to previous tasks, estimated by RD. RDs are trained to align with the feature distribution of the corresponding task via only $\mathcal{L}_\text{RD}$, unaffected by gradients from the classification loss. (c) shows the scenario where incoming distribution can be handled by previously added modules, resulting in no expansion and adapter reuse.
  • Figure 3: Incremental performance of different methods on class-incremental learning benchmarks.
  • Figure 4: Reconstruction error during training to show the dynamic expansion process. Expansion occurs for Tasks 1, 2, and 3, while no expansion is triggered for Tasks 4 and 5 due to no detected distribution shift.
  • Figure 5: Visualization of adapter usage on VTAB. Adapters 1, 2, and 3 are added and trained on Tasks 1, 2, and 3, respectively. Tasks 4 and 5 primarily reuse Adapters 1 and 3 due to similar feature distributions with Tasks 1 and 3.
  • ...and 8 more figures