Table of Contents
Fetching ...

Neural Networks Remember More: The Power of Parameter Isolation and Combination

Biqing Zeng, Zehan Li, Aladdin Ayesh

TL;DR

Catastrophic forgetting in continual learning for PLMs is mitigated by a two-stage approach that isolates task-specific parameters via PEFT and then combines acquired knowledge through Task Arithmetic to form a unified backbone. Each downstream task gets its own PEFT module (Adapter or LoRA) while the backbone is frozen, preventing interference; after training, per-task vectors $τ_i = Φ_i - Φ_{pre}$ are merged as $τ = ∑_{i=1}^N λ τ_i$ and $Θ = Θ_{pre} + τ$. The method also explores initialization strategies for knowledge transfer, showing that initializing new PEFT modules with prior weights accelerates convergence and improves transfer. On five continual learning benchmarks, the approach achieves state-of-the-art performance without storing historical data or relying on a task-id during testing, demonstrating strong robustness to task order and reduced storage costs.

Abstract

Catastrophic forgetting is a pervasive issue for pre-trained language models (PLMs) during continual learning, where models lose previously acquired knowledge when sequentially trained on a series of tasks. The model's ability to retain old tasks is referred to as stability, while its adaptability to new tasks is called plasticity. Therefore, the key to solving this problem is to find a trade-off between the plasticity and stability of the model. To address this issue, in this paper, we propose a novel method to achieve a balance between model stability and plasticity, thereby mitigating catastrophic forgetting. More specifically, our proposed approach leverages parameter isolation and a subsequent combination strategy. Initially, in the training stage, the model adapts to each downstream task via a parameter isolation method to prevent potential interference among different tasks. We then combine all trained parameters, which contain acquired knowledge, using the task arithmetic method and finally apply them to the backbone model. Empirical evaluations on continual language learning benchmarks substantiate the effectiveness of our approach, revealing a marked enhancement over existing state-of-the-art approaches.

Neural Networks Remember More: The Power of Parameter Isolation and Combination

TL;DR

Catastrophic forgetting in continual learning for PLMs is mitigated by a two-stage approach that isolates task-specific parameters via PEFT and then combines acquired knowledge through Task Arithmetic to form a unified backbone. Each downstream task gets its own PEFT module (Adapter or LoRA) while the backbone is frozen, preventing interference; after training, per-task vectors are merged as and . The method also explores initialization strategies for knowledge transfer, showing that initializing new PEFT modules with prior weights accelerates convergence and improves transfer. On five continual learning benchmarks, the approach achieves state-of-the-art performance without storing historical data or relying on a task-id during testing, demonstrating strong robustness to task order and reduced storage costs.

Abstract

Catastrophic forgetting is a pervasive issue for pre-trained language models (PLMs) during continual learning, where models lose previously acquired knowledge when sequentially trained on a series of tasks. The model's ability to retain old tasks is referred to as stability, while its adaptability to new tasks is called plasticity. Therefore, the key to solving this problem is to find a trade-off between the plasticity and stability of the model. To address this issue, in this paper, we propose a novel method to achieve a balance between model stability and plasticity, thereby mitigating catastrophic forgetting. More specifically, our proposed approach leverages parameter isolation and a subsequent combination strategy. Initially, in the training stage, the model adapts to each downstream task via a parameter isolation method to prevent potential interference among different tasks. We then combine all trained parameters, which contain acquired knowledge, using the task arithmetic method and finally apply them to the backbone model. Empirical evaluations on continual language learning benchmarks substantiate the effectiveness of our approach, revealing a marked enhancement over existing state-of-the-art approaches.

Paper Structure

This paper contains 18 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The process of our method. Our approach can be divided into two stages. In stage one, we train the model with PEFT method and initialize the next module with tuned weights. In stage two, as well as testing phase, we combine all adapted PEFT modules using Task Arithmetic method and subsequently apply to backbone model.
  • Figure 2: Average results of the Adapter method with varying bottleneck dimensions in the full setting, across five different datasets.(ag news $\rightarrow$ yelp $\rightarrow$ amazon $\rightarrow$ yahoo $\rightarrow$ db)
  • Figure 3: Mean results of the LoRA method with different LoRA ranks in the full setting, across five different datasets.(ag news $\rightarrow$ yelp $\rightarrow$ amazon $\rightarrow$ yahoo $\rightarrow$ db)