Table of Contents
Fetching ...

Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma

Ruiyu Wang, Sen Wang, Xinxin Zuo, Qiang Sun

TL;DR

AdaLL introduces a simple, universal adapter-based framework for lifelong learning that co-trains a backbone with task-specific adapters under regularization to separate invariant feature learning from task-specific adaptation. By enforcing backbone regularization and employing adapter bottlenecks, AdaLL tackles the stability-plasticity dilemma without freezing the backbone, enabling incremental learning across multiple tasks with improved retention and adaptation. The approach integrates with existing IL methods (e.g., EWC, LwF, DualPrompt) and demonstrates consistent gains on CIFAR-100 and ImageNet-subset across diverse task orders and architectures, while remaining memory-efficient relative to gradient-subspace methods. These findings highlight AdaLL’s practical impact for scalable, architecture-agnostic continual learning that can leverage standard regularization techniques and adapters to improve both stability and plasticity in dynamic environments.

Abstract

Lifelong learning (LL) aims to continuously acquire new knowledge while retaining previously learned knowledge. A central challenge in LL is the stability-plasticity dilemma, which requires models to balance the preservation of previous knowledge (stability) with the ability to learn new tasks (plasticity). While parameter-efficient fine-tuning (PEFT) has been widely adopted in large language models, its application to lifelong learning remains underexplored. To bridge this gap, this paper proposes AdaLL, an adapter-based framework designed to address the dilemma through a simple, universal, and effective strategy. AdaLL co-trains the backbone network and adapters under regularization constraints, enabling the backbone to capture task-invariant features while allowing the adapters to specialize in task-specific information. Unlike methods that freeze the backbone network, AdaLL incrementally enhances the backbone's capabilities across tasks while minimizing interference through backbone regularization. This architectural design significantly improves both stability and plasticity, effectively eliminating the stability-plasticity dilemma. Extensive experiments demonstrate that AdaLL consistently outperforms existing methods across various configurations, including dataset choices, task sequences, and task scales.

Lifelong Learning with Task-Specific Adaptation: Addressing the Stability-Plasticity Dilemma

TL;DR

AdaLL introduces a simple, universal adapter-based framework for lifelong learning that co-trains a backbone with task-specific adapters under regularization to separate invariant feature learning from task-specific adaptation. By enforcing backbone regularization and employing adapter bottlenecks, AdaLL tackles the stability-plasticity dilemma without freezing the backbone, enabling incremental learning across multiple tasks with improved retention and adaptation. The approach integrates with existing IL methods (e.g., EWC, LwF, DualPrompt) and demonstrates consistent gains on CIFAR-100 and ImageNet-subset across diverse task orders and architectures, while remaining memory-efficient relative to gradient-subspace methods. These findings highlight AdaLL’s practical impact for scalable, architecture-agnostic continual learning that can leverage standard regularization techniques and adapters to improve both stability and plasticity in dynamic environments.

Abstract

Lifelong learning (LL) aims to continuously acquire new knowledge while retaining previously learned knowledge. A central challenge in LL is the stability-plasticity dilemma, which requires models to balance the preservation of previous knowledge (stability) with the ability to learn new tasks (plasticity). While parameter-efficient fine-tuning (PEFT) has been widely adopted in large language models, its application to lifelong learning remains underexplored. To bridge this gap, this paper proposes AdaLL, an adapter-based framework designed to address the dilemma through a simple, universal, and effective strategy. AdaLL co-trains the backbone network and adapters under regularization constraints, enabling the backbone to capture task-invariant features while allowing the adapters to specialize in task-specific information. Unlike methods that freeze the backbone network, AdaLL incrementally enhances the backbone's capabilities across tasks while minimizing interference through backbone regularization. This architectural design significantly improves both stability and plasticity, effectively eliminating the stability-plasticity dilemma. Extensive experiments demonstrate that AdaLL consistently outperforms existing methods across various configurations, including dataset choices, task sequences, and task scales.

Paper Structure

This paper contains 28 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Existing methods contribute to incremental learning (IL) in various ways: (a) Regularization-based methods, such as LwF and EWC, introduce regularization constraints to preserve knowledge from previous tasks; (b) Submodule-based approaches, for instance InfLoRA and SideTuning, integrate additional components such as MLPs and LoRA into the backbone network to improve adaptability while freezing the backbone to maintain stability; and (c) Prompt tuning methods (DualPrompt, CodaPrompt, etc.) introduces task-specific prefixes to the key and value in attention modules to improve task-specific performance. Our framework (d) uses submodules in a way that it can be benefited from regularization and other backbone-specific algorithms to ensure a better response to the stability-plasticity dilemma.
  • Figure 2: Architecture of the adapter and a comparison highlighting the distinctions in its implementation between traditional fine-tuning and our method. Left: an adapter consists of the down-projection, the nonlinear transformation, up-projection, and skip-connection. Right: The key difference between traditional use of adapter and ours is that we co-train adapter with the entire network when learning a new task.
  • Figure 3: 10-seed average accuracy for methods with or without adapters on different orderings of CIFAR-100 in Task-IL. From left to right: alphabetical order, iCaRL order and coarse order. The x-axis represent the number of tasks and the y axis represents the TOP-1 accuracy (%). The solid line represents the results with adapter, while the dashed line represents the results without adapter.
  • Figure 4: 10-seed average accuracy for EWC and LwF with learning 5, 10, and 20 classes at a time on CIFAR-100 (alphabetical) in Task-IL. The TOP-1 accuracy is reported, with the solid lines and dashed lines represent the results with and without adapters respectively.
  • Figure 5: 10-seed average accuracy for regularization methods on ImageNet-subset. The x-axis represent the number of tasks and the y axis represents the TOP-1 accuracy (%). The solid line represents the results with adapter, while the dashed line represents the results without adapter. The black, dashed line denotes the result of joint training.
  • ...and 7 more figures