Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers

Diana-Nicoleta Grigore; Mariana-Iuliana Georgescu; Jon Alvarez Justo; Tor Johansen; Andreea Iuliana Ionescu; Radu Tudor Ionescu

Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers

Diana-Nicoleta Grigore, Mariana-Iuliana Georgescu, Jon Alvarez Justo, Tor Johansen, Andreea Iuliana Ionescu, Radu Tudor Ionescu

TL;DR

This paper proposes a novel few-shot feature distillation approach for vision transformers using an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers.

Abstract

Few-shot knowledge distillation recently emerged as a viable approach to harness the knowledge of large-scale pre-trained models, using limited data and computational resources. In this paper, we propose a novel few-shot feature distillation approach for vision transformers. Our approach is based on two key steps. Leveraging the fact that vision transformers have a consistent depth-wise structure, we first copy the weights from intermittent layers of existing pre-trained vision transformers (teachers) into shallower architectures (students), where the intermittence factor controls the complexity of the student transformer with respect to its teacher. Next, we employ an enhanced version of Low-Rank Adaptation (LoRA) to distill knowledge into the student in a few-shot scenario, aiming to recover the information processing carried out by the skipped teacher layers. We present comprehensive experiments with supervised and self-supervised transformers as teachers, on six data sets from various domains (natural, medical and satellite images) and tasks (classification and segmentation). The empirical results confirm the superiority of our approach over state-of-the-art competitors. Moreover, the ablation results demonstrate the usefulness of each component of the proposed pipeline. We release our code at https://github.com/dianagrigore/WeCoLoRA.

Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers

TL;DR

Abstract

Paper Structure (12 sections, 5 equations, 9 figures, 11 tables)

This paper contains 12 sections, 5 equations, 9 figures, 11 tables.

Introduction
Related Work
Method
WeCoLoRA Architecture
Knowledge Distillation with WeCoLoRA
Experiments
Data Sets
Implementation Details
Results
Conclusion
Additional Results
Limitations

Figures (9)

Figure 1: Our unsupervised feature distillation framework is based on two steps. In the first step, weights from intermittent layers of the teacher transformer are copied directly into the student, where the intermittence factor $r$ coincides with the compression ratio between the teacher and the student transformers. In the second step, enhanced LoRA blocks are integrated into the student network. The enhanced LoRA blocks are trained via feature distillation on unlabeled images. In the illustrated example, the compression ratio is $r=2$. Best viewed in color.
Figure 2: Visualizations of the latent spaces learned by WeCo+KD (on the left-hand side) and our method (on the right-hand side). Both visualizations are obtained with t-SNE. The embeddings correspond to images from the RESISC45 Cheng-PIEEE-2017 test set, before linear probing. The features are extracted from student models that are distilled from the supervised teacher ViT-B He-CVPR-2022, using a compression factor of $r=2$. During the distillation procedure, only $1\%$ of ImageNet data set Deng-CVPR-2009 is used. The colors correspond to the class labels from RESISC45. Best viewed in color.
Figure 3: Attention visualizations obtained with Attention Rollout abnar2020quantifying for WeCo+KD and WeCoLoRA, on three test images from ImageNet-1K. The compared students are distilled from the supervised ViT-B teacher Dosovitskiy-ICLR-2021, with a compression factor of $r=2$, on $1\%$ of the original training data. The first column displays the original images, the second column displays the attention of the WeCo+KD-based student, and the third column shows the attention of the WeCoLoRA-based student. Best viewed in color.
Figure 4: Accuracy rates of WeCoLoRA on the CIFAR-100 data set Krizhevsky-TR-2009 when varying the matrix rank. The teacher is the supervised ViT-B Dosovitskiy-ICLR-2021, the compression factor is $r=2$, and the distillation process uses $10\%$ of the ImageNet-1K training set Deng-CVPR-2009.
Figure 5: Accuracy rates obtained by WeCoLoRA and WeCo+KD on the ImageNet-1K Deng-CVPR-2009 downstream task. Results are reported for the supervised ViT-B Dosovitskiy-ICLR-2021 teacher. The horizontal axis corresponds to the percentage of the original training set Deng-CVPR-2009 used during knowledge distillation. Best viewed in color.
...and 4 more figures

Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers

TL;DR

Abstract

Weight Copy and Low-Rank Adaptation for Few-Shot Distillation of Vision Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (9)