Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Zheda Mai; Ping Zhang; Cheng-Hao Tu; Hong-You Chen; Li Zhang; Wei-Lun Chao

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, Wei-Lun Chao

TL;DR

The paper addresses the need for a systematic understanding of parameter-efficient fine-tuning (PEFT) in visual recognition by conducting a unifying empirical study of representative ViT-based PEFT methods. It employs fair hyperparameter tuning across low-shot VTAB-1K and many-shot settings, plus robustness tests with distribution shifts, to reveal that PEFT can match or exceed full fine-tuning in many scenarios and offers complementary information via diverse predictions. Key contributions include a reproducible evaluation framework, practical usage guidelines, and insights into ensemble opportunities and robustness improvements through WiSE. The findings have practical impact by guiding practitioners on when and how to apply PEFT, and they point to fruitful research directions such as leveraging prediction diversity and developing robust PEFT strategies for distribution shifts.

Abstract

Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyperparameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementarity) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT while using significantly fewer parameters. Lastly, we investigate PEFT's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., CLIP). Perhaps not surprisingly, PEFT approaches outperform full FT alone. However, with weight-space ensembles, full FT can better balance target distribution and distribution shift performance, suggesting a future research direction for robust PEFT.

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

TL;DR

Abstract

Paper Structure (56 sections, 17 equations, 16 figures, 7 tables)

This paper contains 56 sections, 17 equations, 16 figures, 7 tables.

Introduction
Background
Large pre-trained models
Parameter-Efficient Fine-Tuning (PEFT)
Related work and comparison
PEFT Methods in Low-Shots Regime
Different PEFT Approaches Offer Complementary Information
PEFT Methods in Many-Shot Regime
Why Do PEFT Methods Work?
How Robust are PEFT Methods to Distribution Shifts?
Conclusion
Experiment and Dataset Details
Experiment Details
VTAB-1K
Many-shot
...and 41 more sections

Figures (16)

Figure 1: Highlights of our insights. (a) Downstream accuracy: with proper implementation and fair tuning, different PEFT methods achieve similar accuracy ($\bullet$-$\bullet$: the range from the most to the least accurate methods) and consistently outperform linear probing ($\times$) and full FT ($\blacksquare$) on VTAB-1K. (b) Diverse predictions: despite reaching similar downstream performance, different PEFT methods produce diverse predictions. This opens new opportunities for ensemble approaches and other learning paradigms (e.g. semi-supervised learning) that can exploit the prediction discrepancies. (c) Distribution shift accuracy: FT a CLIP ViT-B/16, known for its generalizability across domains, with PEFT on ImageNet-1K (100 samples/class) better preserves the distribution shift accuracy (Y-axis, averaged across ImageNet-(V2, S, R, A) than full FT, evidenced by the $\star$ points. Interestingly, weight-space ensembles (WiSE) wortsman2022robust is applicable between PEFT's FT model and the pre-trained model ($\blacksquare$), but not as effective as applying it to the fully FT model. Details are in \ref{['sec:few']}, \ref{['sec:complememtary']} and \ref{['sec: robust']}.
Figure 2: Ranking frequency of 15 methods (14 PEFT + linear probing) for three groups in VTAB-1K. Element $(i, j)$ is the number of times method $i$ ranks $j^{th}$ in each group. Methods are ordered by mean ranks (in brackets). The parameters column shows the # of trainable parameters in millions. More details are in \ref{['-sec:results']}.
Figure 3: (a) Prediction similarity analysis: element $(i,j)$ shows the percentage of samples that method $i$ and $j$ predict the same. Although different methods achieve similar accuracy, they have diverse predictions. (b)The wrong prediction overlaps of LoRA, Adapter, and SSF for the 5K least confident data. Correct prediction overlaps for the 5K most confident data are shown in \ref{['fig:highconf']}. They are FT on CIFAR100 (VTAB-1K). More results for (a) and (b) are in \ref{['-sec:results']}.
Figure 4: Ensemble (majority vote) shows consistent gain on most datasets thanks to the diverse predictions.
Figure 5: PEFT accuracy in many-shot regimes, with different parameter sizes (X-axis) on three datasets from different domains. Even 2%-5% trainable parameters allow the models to have sufficient capacity to learn from full data. Details are in \ref{['-sec:results']}).
...and 11 more figures

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

TL;DR

Abstract

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (16)