Table of Contents
Fetching ...

SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models

Jinghan He, Haiyun Guo, Kuan Zhu, Zihan Zhao, Ming Tang, Jinqiao Wang

TL;DR

SElective attEntion-guided Knowledge Retention method (SEEKR) is proposed, which performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads.

Abstract

Continual learning (CL) is crucial for language models to dynamically adapt to the evolving real-world demands. To mitigate the catastrophic forgetting problem in CL, data replay has been proven a simple and effective strategy, and the subsequent data-replay-based distillation can further enhance the performance. However, existing methods fail to fully exploit the knowledge embedded in models from previous tasks, resulting in the need for a relatively large number of replay samples to achieve good results. In this work, we first explore and emphasize the importance of attention weights in knowledge retention, and then propose a SElective attEntion-guided Knowledge Retention method (SEEKR) for data-efficient replay-based continual learning of large language models (LLMs). Specifically, SEEKR performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads. Experimental results on two continual learning benchmarks for LLMs demonstrate the superiority of SEEKR over the existing methods on both performance and efficiency. Explicitly, SEEKR achieves comparable or even better performance with only 1/10 of the replayed data used by other methods, and reduces the proportion of replayed data to 1%.

SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models

TL;DR

SElective attEntion-guided Knowledge Retention method (SEEKR) is proposed, which performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads.

Abstract

Continual learning (CL) is crucial for language models to dynamically adapt to the evolving real-world demands. To mitigate the catastrophic forgetting problem in CL, data replay has been proven a simple and effective strategy, and the subsequent data-replay-based distillation can further enhance the performance. However, existing methods fail to fully exploit the knowledge embedded in models from previous tasks, resulting in the need for a relatively large number of replay samples to achieve good results. In this work, we first explore and emphasize the importance of attention weights in knowledge retention, and then propose a SElective attEntion-guided Knowledge Retention method (SEEKR) for data-efficient replay-based continual learning of large language models (LLMs). Specifically, SEEKR performs attention distillation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based measures are used to identify the most valuable attention heads. Experimental results on two continual learning benchmarks for LLMs demonstrate the superiority of SEEKR over the existing methods on both performance and efficiency. Explicitly, SEEKR achieves comparable or even better performance with only 1/10 of the replayed data used by other methods, and reduces the proportion of replayed data to 1%.

Paper Structure

This paper contains 27 sections, 15 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: Demonstration of the critical role of attention weights in knowledge retention. We apply DER++ buzzega2020dark for continual learning on the TRACE benchmark wang2023trace to obtain multiple old task models and the final model. Grafting the attention weights of the old models onto the final model at inference can maintain better performance on the old tasks. Moreover, the final model obtained by our continual learning method, SEEKR, achieves similar results.
  • Figure 2: Results of SEEKR across different distillation budgets and different replay data ratios.
  • Figure 3: The continual learning performance and the changes of general ability with Vicuna-13B-v1.5.
  • Figure 4: Histogram of the cumulative variation in the attention weights of the attention heads in the model during sequential finetuning.
  • Figure 5: Visualization of the importance scores of all heads in the model.