Table of Contents
Fetching ...

Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting

Reza Akbarian Bafghi, Nidhin Harilal, Claire Monteleoni, Maziar Raissi

TL;DR

This work tackles catastrophic forgetting in Vision Transformers during fine-tuning on new domains, showing that ViTs' general abilities degrade markedly after transfer, e.g., a DINO ViT/B-16 pre-trained on ImageNet-1K loses over $70\%$ accuracy after $10$ iterations of fine-tuning on CIFAR-100. It adapts NLP-inspired parameter-efficient fine-tuning methods, Block Expansion and LoRA, to ViTs, with Block Expansion expanding depth via identity blocks and LoRA adding low-rank adapters so that $W' = W + AB$. Block Expansion generally preserves ImageNet-1K performance while achieving strong transfer accuracy; LoRA is effective in some domains but can degrade on simpler datasets like CIFAR-10; standard full fine-tuning suffers severe forgetting. Overall, the study demonstrates that PEFT can enable continual adaptation of ViTs with much smaller trainable parameter budgets, reducing forgetting while preserving core knowledge.

Abstract

Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in vision transformers (ViTs), where post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities. For instance, a DINO ViT-Base/16 pre-trained on ImageNet-1k loses over 70% accuracy on ImageNet-1k after just 10 iterations of fine-tuning on CIFAR-100. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains while preserving their initial knowledge. In this work, we study two new parameter-efficient fine-tuning strategies: (1)~Block Expansion, and (2) Low-rank adaptation (LoRA). Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains while offering significantly greater parameter efficiency. Notably, we find that Block Expansion experiences only a minimal performance drop in the pre-training domain, thereby effectively mitigating catastrophic forgetting in pre-trained ViTs.

Parameter Efficient Fine-tuning of Self-supervised ViTs without Catastrophic Forgetting

TL;DR

This work tackles catastrophic forgetting in Vision Transformers during fine-tuning on new domains, showing that ViTs' general abilities degrade markedly after transfer, e.g., a DINO ViT/B-16 pre-trained on ImageNet-1K loses over accuracy after iterations of fine-tuning on CIFAR-100. It adapts NLP-inspired parameter-efficient fine-tuning methods, Block Expansion and LoRA, to ViTs, with Block Expansion expanding depth via identity blocks and LoRA adding low-rank adapters so that . Block Expansion generally preserves ImageNet-1K performance while achieving strong transfer accuracy; LoRA is effective in some domains but can degrade on simpler datasets like CIFAR-10; standard full fine-tuning suffers severe forgetting. Overall, the study demonstrates that PEFT can enable continual adaptation of ViTs with much smaller trainable parameter budgets, reducing forgetting while preserving core knowledge.

Abstract

Artificial neural networks often suffer from catastrophic forgetting, where learning new concepts leads to a complete loss of previously acquired knowledge. We observe that this issue is particularly magnified in vision transformers (ViTs), where post-pre-training and fine-tuning on new tasks can significantly degrade the model's original general abilities. For instance, a DINO ViT-Base/16 pre-trained on ImageNet-1k loses over 70% accuracy on ImageNet-1k after just 10 iterations of fine-tuning on CIFAR-100. Overcoming this stability-plasticity dilemma is crucial for enabling ViTs to continuously learn and adapt to new domains while preserving their initial knowledge. In this work, we study two new parameter-efficient fine-tuning strategies: (1)~Block Expansion, and (2) Low-rank adaptation (LoRA). Our experiments reveal that using either Block Expansion or LoRA on self-supervised pre-trained ViTs surpass fully fine-tuned ViTs in new domains while offering significantly greater parameter efficiency. Notably, we find that Block Expansion experiences only a minimal performance drop in the pre-training domain, thereby effectively mitigating catastrophic forgetting in pre-trained ViTs.
Paper Structure (11 sections, 4 figures, 1 table)

This paper contains 11 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of different ViT fine-tuning approaches: (a) Linear-probing ViT model with all weights frozen (cyan blocks) and a trainable classifier (white block). (b) Fully fine-tuned ViT with all trainable weights (white blocks). (c) Block Expansion with additional blocks containing trainable zero-initialized linear layers (red blocks) and other trainable parameters (white blocks). (d) Low-Rank Adaptation (LoRA) weights (white blocks) added in parallel to the frozen pre-trained weights of Queries and Values (cyan blocks).
  • Figure 2: Comparison of top-1 accuracy between fine-tuned DINO ViT/B-16 models on transfer datasets and ImageNet-1K: the figure illustrates that models fine-tuned with Block Expansion achieve high accuracy on target datasets (e.g., CIFAR-10) while also preserving knowledge of the pre-trained dataset (ImageNet-1K).
  • Figure 3: Exploring learning rate effects on catastrophic forgetting in fine-tuning with Block Expansion shows that higher rates, despite similar transfer dataset performance, worsen source domain forgetting, especially with more blocks added. The green line represents the accuracy of the unchanged backbone on IN-1K.
  • Figure 4: Comparing top-1 accuracy of fine-tuned DINO ViT/S-16 models shows LoRA and Block Expansion methods outperform traditional fine-tuning on both transfer datasets and ImageNet-1K.