Table of Contents
Fetching ...

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

Albin Soutif--Cormerais, Simone Magistri, Joost van de Weijer, Andew D. Bagdanov

TL;DR

This work analyzes forgetting in pretrained vision models when incrementally updating them with per-task LoRA adapters that are merged back into the base weights after each task. It compares Vision Transformers (ViT) and ResNet-50 across four fine-grained datasets (Cars, Flowers, Aircraft, Birds) and varying LoRA ranks, under short and long task sequences. Key findings show that lower LoRA rank reduces forgetting of the pretrained task and downstream tasks, and ViTs exhibit contextual forgetting where pretrained classes semantically related to the downstream domain are forgotten, a phenomenon not seen in ResNets; forward transfer can occur in longer sequences for ViT, while LwF further aids stability. The results highlight the importance of rank choice and architecture when applying incremental low-rank updates to pretrained models, suggesting directions for adaptive replay and rank-aware strategies to preserve upstream knowledge while improving downstream performance.

Abstract

Broad, open source availability of large pretrained foundation models on the internet through platforms such as HuggingFace has taken the world of practical deep learning by storm. A classical pipeline for neural network training now typically consists of finetuning these pretrained network on a small target dataset instead of training from scratch. In the case of large models this can be done even on modest hardware using a low rank training technique known as Low-Rank Adaptation (LoRA). While Low Rank training has already been studied in the continual learning setting, existing works often consider storing the learned adapter along with the existing model but rarely attempt to modify the weights of the pretrained model by merging the LoRA with the existing weights after finishing the training of each task. In this article we investigate this setting and study the impact of LoRA rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones. We observe that this rank has an important impact on forgetting of both the pretraining and downstream tasks. We also observe that vision transformers finetuned in that way exhibit a sort of ``contextual'' forgetting, a behaviour that we do not observe for residual networks and that we believe has not been observed yet in previous continual learning works.

An Empirical Analysis of Forgetting in Pre-trained Models with Incremental Low-Rank Updates

TL;DR

This work analyzes forgetting in pretrained vision models when incrementally updating them with per-task LoRA adapters that are merged back into the base weights after each task. It compares Vision Transformers (ViT) and ResNet-50 across four fine-grained datasets (Cars, Flowers, Aircraft, Birds) and varying LoRA ranks, under short and long task sequences. Key findings show that lower LoRA rank reduces forgetting of the pretrained task and downstream tasks, and ViTs exhibit contextual forgetting where pretrained classes semantically related to the downstream domain are forgotten, a phenomenon not seen in ResNets; forward transfer can occur in longer sequences for ViT, while LwF further aids stability. The results highlight the importance of rank choice and architecture when applying incremental low-rank updates to pretrained models, suggesting directions for adaptive replay and rank-aware strategies to preserve upstream knowledge while improving downstream performance.

Abstract

Broad, open source availability of large pretrained foundation models on the internet through platforms such as HuggingFace has taken the world of practical deep learning by storm. A classical pipeline for neural network training now typically consists of finetuning these pretrained network on a small target dataset instead of training from scratch. In the case of large models this can be done even on modest hardware using a low rank training technique known as Low-Rank Adaptation (LoRA). While Low Rank training has already been studied in the continual learning setting, existing works often consider storing the learned adapter along with the existing model but rarely attempt to modify the weights of the pretrained model by merging the LoRA with the existing weights after finishing the training of each task. In this article we investigate this setting and study the impact of LoRA rank on the forgetting of the pretraining foundation task and on the plasticity and forgetting of subsequent ones. We observe that this rank has an important impact on forgetting of both the pretraining and downstream tasks. We also observe that vision transformers finetuned in that way exhibit a sort of ``contextual'' forgetting, a behaviour that we do not observe for residual networks and that we believe has not been observed yet in previous continual learning works.
Paper Structure (12 sections, 1 equation, 15 figures, 4 tables)

This paper contains 12 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Test Accuracy on the pretrain task Imagenet1k (Left) and on the downstream task Cars (Right) when fine tuning on the Short Setting with multiple LoRA ranks, using ViT network (Blue) and ResNet-50 (Green). Results are averaged over 4 runs using different random seed.
  • Figure 2: Test Accuracy on the pretrain task Imagenet1k (Left) and on the downstream task Cars (Right) when training on the Short Setting with multiple LoRA ranks in combination with the LwF method, using ViT network (Blue) and ResNet-50 (Green). Results are averaged over 4 runs using different random seed.
  • Figure 3: 20 most forgotten categories of Imagenet1k after learning on Stanford Cars (Left) and Aircraft (Right), using direct transfer from the pretrained ViT model with a LoRA adapter of rank 32
  • Figure 4: Wu-Palmer similarity between the forgotten category names an the word "Car" after learning on Cars dataset (Left) and with the word "Plane" after learning on Aircraft dataset (Right). Comparison of the similarity distribution of all Imagenet categories versus the forgotten categories (weighted by the amount of test images forgotten in this category), when using direct transfer from the pretrained ViT model with a LoRA adapter of rank 32
  • Figure 5: (Left) 20 most forgotten classes of Imagenet1k after learning on Stanford Cars, (Right) Wu-Palmer similarity between the forgotten category names an the word "Car". Comparison of the similarity distribution of all Imagenet categories versus the forgotten categories (weighted by the amount of test images forgotten in this category), when using direct transfer from the pretrained ResNet-50 model with a LoRA adapter of rank 1
  • ...and 10 more figures