Table of Contents
Fetching ...

Collaboratively adding new knowledge to an LLM

Rhui Dih Lee, Laura Wynter

TL;DR

Overall, LoRA performs better in most cases than full-fine tuning of all parameters when both new knowledge acquisition and retention of old, including recent, knowledge are taken into account.

Abstract

We address the question of how to successively add new knowledge to an LLM whilst retaining previously-added knowledge. We consider two settings, semi-cooperative and fully-cooperative. Overall, LoRA performs better in most cases than full-fine tuning of all parameters when both new knowledge acquisition and retention of old, including recent, knowledge are taken into account. In the semi-cooperative setting, where datasets are not available after training, MOE mixing, model merging, and LoRA-based orthogonal subspace sequential learning, using a small weight on the orthogonality term, perform well. In the fully-cooperative setting where datasets remain available, joint training and sequential training with replay are both effective approaches with LoRA training generally preferable to full fine-tuning. The codes needed to reproduce the results are provided in an open source repository.

Collaboratively adding new knowledge to an LLM

TL;DR

Overall, LoRA performs better in most cases than full-fine tuning of all parameters when both new knowledge acquisition and retention of old, including recent, knowledge are taken into account.

Abstract

We address the question of how to successively add new knowledge to an LLM whilst retaining previously-added knowledge. We consider two settings, semi-cooperative and fully-cooperative. Overall, LoRA performs better in most cases than full-fine tuning of all parameters when both new knowledge acquisition and retention of old, including recent, knowledge are taken into account. In the semi-cooperative setting, where datasets are not available after training, MOE mixing, model merging, and LoRA-based orthogonal subspace sequential learning, using a small weight on the orthogonality term, perform well. In the fully-cooperative setting where datasets remain available, joint training and sequential training with replay are both effective approaches with LoRA training generally preferable to full fine-tuning. The codes needed to reproduce the results are provided in an open source repository.

Paper Structure

This paper contains 10 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: One-round training, comparing the base model, called "base" in the figure, with performance after fine-tuning the model on 6 different dataset-tasks. New knowledge acquisition on average is similar with LoRA and FFT.
  • Figure 2: Average over the one-round trained models, which were trained from the base model, and each tuned on one dataset-tasks. Average old-knowledge retention after one-round training across models, using full fine-tuning (FFT) suffers considerably more degradation then with LoRA training. Performance of the original base model is shown in the figure as "base".
  • Figure 3: Sequential training: old-knowledge retention after 7 rounds of sequential training, each training starting from the previously-trained model. Full fine-tuning suffers considerably more degradation of old knowledge than LoRA training. "Base" refers to the scores obtained by the original llama3.1 model without further training.
  • Figure 4: Sequential training: new knowledge acquisition and retention after 7 rounds of training, each training starting from the previously-trained model. Two right-most bars are one-round LoRA and one-round FFT training, which can be viewed as upper bounds on training performance. Sequential LoRA training is mostly adequate and comparable on average to sequential FFT on new data.
  • Figure 5: Knowledge acquisition quality after training using LoRA versus orthogonal LoRA with low ($\lambda=0.5$) and high ($\lambda=5$)weights. Quality is evaluated here immediately after training on the task in question before further training, hence only acquisition is represented here, not knowledge retention.
  • ...and 5 more figures