DATA: Decomposed Attention-based Task Adaptation for Rehearsal-Free Continual Learning
Huanxuan Liao, Shizhu He, Yupu Hao, Jun Zhao, Kang Liu
TL;DR
This work tackles the problem of balancing plasticity and catastrophic forgetting in rehearsal-free continual learning for large language models. It introduces DATA, which decomposes knowledge into high-rank and low-rank adapters and fuses them via a decomposed attention-based weighting mechanism, complemented by expansion, orthogonality constraints, and stochastic restoration. Across Standard CL, Long Sequence, and TRACE benchmarks on multiple models, DATA achieves state-of-the-art performance, significantly reducing forgetting while maintaining or improving plasticity, and can augment existing rehearsal-free methods. The approach is parameter-efficient, privacy-preserving during training (no replay data stored), and scalable to real-world continual adaptation scenarios, albeit with added architectural complexity and dependence on task/data quality.
Abstract
Continual learning (CL) is essential for Large Language Models (LLMs) to adapt to evolving real-world demands, yet they are susceptible to catastrophic forgetting (CF). While traditional CF solutions rely on expensive data rehearsal, recent rehearsal-free methods employ model-based and regularization-based strategies to address this issue. However, these approaches often neglect the model's plasticity, which is crucial to achieving optimal performance on newly learned tasks. Consequently, a key challenge in CL is striking a balance between preserving plasticity and mitigating CF. To tackle this challenge, we propose the $\textbf{D}$ecomposed $\textbf{A}$ttention-based $\textbf{T}$ask $\textbf{A}$daptation (DATA), which explicitly decouples and learns both task-specific and task-shared knowledge using high-rank and low-rank task adapters (e.g., LoRAs). For new tasks, DATA dynamically adjusts the weights of adapters of different ranks based on their relevance and distinction from previous tasks, allowing the model to acquire new task-specific skills while effectively retaining previously learned knowledge. Specifically, we implement a decomposed component weighting strategy comprising learnable components that collectively generate attention-based weights, allowing the model to integrate and utilize diverse knowledge from each DATA. Extensive experiments on three widely used benchmarks demonstrate that our proposed method achieves state-of-the-art performance. Notably, our approach significantly enhances model plasticity and mitigates CF by extending learnable components and employing stochastic restoration during training iterations.
