Table of Contents
Fetching ...

DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning

Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu

TL;DR

This work tackles the problem of balancing plasticity and catastrophic forgetting in rehearsal-free continual learning for large language models. It introduces DATA, which decomposes knowledge into high-rank and low-rank adapters and fuses them via a decomposed attention-based weighting mechanism, complemented by expansion, orthogonality constraints, and stochastic restoration. Across Standard CL, Long Sequence, and TRACE benchmarks on multiple models, DATA achieves state-of-the-art performance, significantly reducing forgetting while maintaining or improving plasticity, and can augment existing rehearsal-free methods. The approach is parameter-efficient, privacy-preserving during training (no replay data stored), and scalable to real-world continual adaptation scenarios, albeit with added architectural complexity and dependence on task/data quality.

Abstract

Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.

DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning

TL;DR

This work tackles the problem of balancing plasticity and catastrophic forgetting in rehearsal-free continual learning for large language models. It introduces DATA, which decomposes knowledge into high-rank and low-rank adapters and fuses them via a decomposed attention-based weighting mechanism, complemented by expansion, orthogonality constraints, and stochastic restoration. Across Standard CL, Long Sequence, and TRACE benchmarks on multiple models, DATA achieves state-of-the-art performance, significantly reducing forgetting while maintaining or improving plasticity, and can augment existing rehearsal-free methods. The approach is parameter-efficient, privacy-preserving during training (no replay data stored), and scalable to real-world continual adaptation scenarios, albeit with added architectural complexity and dependence on task/data quality.

Abstract

Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the ecomposed ttention-based ask daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.

Paper Structure

This paper contains 27 sections, 11 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: (a) Comparisons of previous CL methods with DATA. Rehearsal-free indicates that the methods do not require storing sample features from previous tasks. Inference Efficiency denotes the computational efficiency during the inference phase. Plasticity is the ability to adapt effectively to new tasks. (b) Comparison of AP and Forget (Sec. \ref{['sec:dataset']}) across different CL methods.
  • Figure 2: (a) We perform a t-SNE distribution analysis of different adapter representations on Order 1(4 tasks). The low-rank branch shows a consistent distribution across the target tasks and the high-rank branch exhibits substantial distribution differences across the target tasks. (b) We calculate the divergence of different branches in Order 4 (15 tasks). In comparison to the source model, low-rank adapters effectively alleviates inter-task divergence across all 14 task transitions, while the high-rank adapters significantly enhances intra-task feature aggregation.
  • Figure 3: Overview of the DATA framework.(a) We introduce a novel decomposed component weighting strategy for generating attention-based weights, parameterized by a set of extended weight components, each associated with a corresponding key and attention vector. Only the weighting and adapter parameters are trainable which is parameter efficient and no training data is stored for the replay which is memory-efficient and privacy-preserving. (b) We integrate low-rank and high-rank DATA into the linear layers of the pre-trained model guided by the generated weights, allowing for the dynamic fusion of knowledge from each DATA with different task representations.
  • Figure 4: The shifts in CL methods with FP and Forget on LS Order 4. DATA prevents the shift (blue bar) and thus mitigates forgetting (orange line).
  • Figure 5: Performance of different LoRA ranks.