Learning Attentional Mixture of LoRAs for Language Model Continual Learning

Jialin Liu; Jianhua Wu; Jie Liu; Yutai Duan

Learning Attentional Mixture of LoRAs for Language Model Continual Learning

Jialin Liu, Jianhua Wu, Jie Liu, Yutai Duan

TL;DR

Attentional Mixture of LoRAs (AM-LoRA), a continual learning approach tailored for LLMs that learns a sequence of LoRAs for a series of tasks to continually learn knowledge from different tasks, and introduces a sparse constraint in the learning process to make the attention vector more sparse.

Abstract

Fine-tuning large language models (LLMs) with Low-Rank adaption (LoRA) is widely acknowledged as an effective approach for continual learning for new tasks. However, it often suffers from catastrophic forgetting when dealing with multiple tasks sequentially. To this end, we propose Attentional Mixture of LoRAs (AM-LoRA), a continual learning approach tailored for LLMs. Specifically, AM-LoRA learns a sequence of LoRAs for a series of tasks to continually learn knowledge from different tasks. The key of our approach is that we devise an attention mechanism as a knowledge mixture module to adaptively integrate information from each LoRA. With the attention mechanism, AM-LoRA can efficiently leverage the distinctive contributions of each LoRA, while mitigating the risk of mutually negative interactions among them that may lead to catastrophic forgetting. Moreover, we further introduce $L1$ norm in the learning process to make the attention vector more sparse. The sparse constraints can enable the model to lean towards selecting a few highly relevant LoRAs, rather than aggregating and weighting all LoRAs collectively, which can further reduce the impact stemming from mutual interference. Experimental results on continual learning benchmarks indicate the superiority of our proposed method.

Learning Attentional Mixture of LoRAs for Language Model Continual Learning

TL;DR

Abstract

norm in the learning process to make the attention vector more sparse. The sparse constraints can enable the model to lean towards selecting a few highly relevant LoRAs, rather than aggregating and weighting all LoRAs collectively, which can further reduce the impact stemming from mutual interference. Experimental results on continual learning benchmarks indicate the superiority of our proposed method.

Paper Structure (31 sections, 19 equations, 5 figures, 5 tables)

This paper contains 31 sections, 19 equations, 5 figures, 5 tables.

Introduction
Related Work
Parameter Efficient Fine-Tuning
Continual Learning
Method
Preliminary
Framework
Incremental Learning of Task-specific LoRAs
Attentional Selector
Loss Function with Sparsity Constraint
AM-LoRA vs. O-LoRA
Experiments
Datasets
Comparison Methods
Experimental Settings
...and 16 more sections

Figures (5)

Figure 1: Intuitive demonstration of the decentralized problem of optimal solutions. (a) is the distance relationship diagram between the possible optimal solutions of the two tasks in the normal pre-trained language model. There is a common optimal solution at the intersection of the two. The parameter space of the LLM (b) may be too large, causing the optimal solution areas of the two tasks to be too far apart, so that there is no common optimal solution.
Figure 2: An overview of AM-LoRA. During the training of the current task, the weight of the pre-trained model and the LoRA parameters of the previous task are frozen to train the Attentional Selector and LoRA of the new task. In this process, Task-specific LoRA Matrix Sequences are mainly responsible for the learning of new task knowledge, while Attentional Selector focuses more on learning how to adaptively integrate information from each LoRA when a new task is added, making full use of the knowledge of previous tasks for efficient learning.
Figure 3: Comparison of LLaMA2-7B models with and without AM-LoRA on Standard CL benchmarks. We report the change of metrics separately for each task as the training task increases, and since the fourth task is the last one where no trend change can be observed, we only show the situation for the first three tasks in each order.
Figure 4: Attention score distribution of each LoRA in query matrix and value matrix (LLaMA2-7B). (a) is the weight distribution diagram of AM-LoRA with query matrix bypass. It can be observed that it mainly uses the knowledge in LoRA1, LoRA3, and LoRA4. The bypass of the value matrix (b) is more inclined to utilize the knowledge in the LoRA of this task (LoRA4).
Figure 5: Comparison of the effectiveness of different AM-LoRA schemes on the Standard CL benchmarks. Among them, NR means that only the new task's dense layer participates in the training, AR means that the dense layers of all tasks participate in the training, and L1 means that sparse constraints are added to dense layers.

Learning Attentional Mixture of LoRAs for Language Model Continual Learning

TL;DR

Abstract

Learning Attentional Mixture of LoRAs for Language Model Continual Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)