SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models

Weixiang Zhao; Shilong Wang; Yulin Hu; Yanyan Zhao; Bing Qin; Xuanyu Zhang; Qing Yang; Dongliang Xu; Wanxiang Che

SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models

Weixiang Zhao, Shilong Wang, Yulin Hu, Yanyan Zhao, Bing Qin, Xuanyu Zhang, Qing Yang, Dongliang Xu, Wanxiang Che

TL;DR

SAPT tackles continual learning for large language models by aligning parameter-efficient tuning blocks (PET) learning with their selection through a Shared Attentive Learning & Selection Module (SALS). It introduces an Attentive Reflection Module (ARM) that uses generated pseudo samples to recall past attentions, enabling effective backward compatibility without task IDs at test time. Empirical results on SuperNI and Long Sequence benchmarks show SAPT consistently improves CF resistance and KT across diverse model sizes and architectures, outperforming state-of-the-art PET-based baselines. The framework demonstrates scalability, practicality, and broad applicability to LLMs, with strong implications for real-world, dynamic-task settings.

Abstract

The continual learning (CL) ability is vital for deploying large language models (LLMs) in the dynamic world. Existing methods devise the learning module to acquire task-specific knowledge with parameter-efficient tuning (PET) block and the selection module to pick out the corresponding one for the testing input, aiming at handling the challenges of catastrophic forgetting and knowledge transfer in CL. However, these methods tend to address only one of the challenges, ignoring the potential of aligning the two modules to effectively address catastrophic forgetting and knowledge transfer simultaneously. To this end, we propose a novel Shared Attention Framework (SAPT), to align the PET learning and selection via the Shared Attentive Learning \& Selection module. Extensive Experiments on two CL benchmarks demonstrate the superiority of SAPT. Moreover, SAPT consistently demonstrates its superiority when we scale it to different model sizes (from 770M to 13B), different model architectures (T5 and LLaMA-2) and unseen tasks.

SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models

TL;DR

Abstract

Paper Structure (49 sections, 7 equations, 12 figures, 12 tables)

This paper contains 49 sections, 7 equations, 12 figures, 12 tables.

Introduction
Related Works
Parameter-Efficient Tuning
Continual Learning for LLMs
Conventional Continual Learning (CL)
Continual Learning for LLMs with PET.
Problem Definition and Setup
Methodology
Overview of the Framework
Shared Attentive Learning & Selection Module
PET Methods.
Attentive Learning.
Attentive Selection.
Shared Attentive Learning & Selection.
Attentive Reflection Module
...and 34 more sections

Figures (12)

Figure 1: The conceptual framework for the learning and the selection module to achieve the continual learning of large language models based on PET blocks when the new Dialogue Generation task arrives. Dashed lines represent the working process of existing works while solid lines are for that of our SAPT in this work.
Figure 2: The overall architecture of our proposed SAPT. We assume that SAPT is currently at the time step $3$ to learn the task $\mathcal{T}_3$. (1) In the SALS, as illustrated by the solid lines, the resulting attention weight $\boldsymbol{a}_3$ of task $\mathcal{T}_3$ is first obtained via the instance-level shared attention operation between the input $x_3$ and PET key vectors $\{\boldsymbol{k}_1, \boldsymbol{k}_2, \boldsymbol{k}_3\}$, to perform weighted combination of all PET blocks $\{B_1, B_2, B_3\}$ for the attentive learning of the current task $\mathcal{T}_3$. And dashed lines display the process of attentive selection, following the same process of shared attention to reach the attention weight $\boldsymbol{a}_3$ and utilizing it to handle given inputs at the testing time. (2) In the ARM, for previous tasks $\mathcal{T}_1$ and $\mathcal{T}_2$, the current attention weights of them ($\hat{\boldsymbol{a}_1}$ and $\hat{\boldsymbol{a}_2}$), are pulled back to their original states ($\boldsymbol{a}_1$ and $\boldsymbol{a}_2$), with the introduction of generated pseudo samples $\hat{x}_1$ and $\hat{x}_2$.
Figure 3: Visualization on shared attention of SAPT-Prompt on the Long Sequence benchmark during the training for each task (left) and testing for all tasks after the training of the last task (right).
Figure 4: Performance of SAPT and baseline methods based on different size of T5-model in terms of performance of continual learning, forgetting rate and forward transfer.
Figure 5: Comparison of SAPT and baselines based on different architectures of LLM backbones, including T5 (encoder-decoder) and LLaMA-2 (decoder-only).
...and 7 more figures

SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models

TL;DR

Abstract

SAPT: A Shared Attention Framework for Parameter-Efficient Continual Learning of Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)