Table of Contents
Fetching ...

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

Zheda Mai, Ping Zhang, Cheng-Hao Tu, Hong-You Chen, Li Zhang, Wei-Lun Chao

TL;DR

The paper addresses the need for a systematic understanding of parameter-efficient fine-tuning (PEFT) in visual recognition by conducting a unifying empirical study of representative ViT-based PEFT methods. It employs fair hyperparameter tuning across low-shot VTAB-1K and many-shot settings, plus robustness tests with distribution shifts, to reveal that PEFT can match or exceed full fine-tuning in many scenarios and offers complementary information via diverse predictions. Key contributions include a reproducible evaluation framework, practical usage guidelines, and insights into ensemble opportunities and robustness improvements through WiSE. The findings have practical impact by guiding practitioners on when and how to apply PEFT, and they point to fruitful research directions such as leveraging prediction diversity and developing robust PEFT strategies for distribution shifts.

Abstract

Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyperparameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementarity) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT while using significantly fewer parameters. Lastly, we investigate PEFT's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., CLIP). Perhaps not surprisingly, PEFT approaches outperform full FT alone. However, with weight-space ensembles, full FT can better balance target distribution and distribution shift performance, suggesting a future research direction for robust PEFT.

Lessons and Insights from a Unifying Study of Parameter-Efficient Fine-Tuning (PEFT) in Visual Recognition

TL;DR

The paper addresses the need for a systematic understanding of parameter-efficient fine-tuning (PEFT) in visual recognition by conducting a unifying empirical study of representative ViT-based PEFT methods. It employs fair hyperparameter tuning across low-shot VTAB-1K and many-shot settings, plus robustness tests with distribution shifts, to reveal that PEFT can match or exceed full fine-tuning in many scenarios and offers complementary information via diverse predictions. Key contributions include a reproducible evaluation framework, practical usage guidelines, and insights into ensemble opportunities and robustness improvements through WiSE. The findings have practical impact by guiding practitioners on when and how to apply PEFT, and they point to fruitful research directions such as leveraging prediction diversity and developing robust PEFT strategies for distribution shifts.

Abstract

Parameter-efficient fine-tuning (PEFT) has attracted significant attention due to the growth of pre-trained model sizes and the need to fine-tune (FT) them for superior downstream performance. Despite a surge in new PEFT methods, a systematic study to understand their performance and suitable application scenarios is lacking, leaving questions like "when to apply PEFT" and "which method to use" largely unanswered, especially in visual recognition. In this paper, we conduct a unifying empirical study of representative PEFT methods with Vision Transformers. We systematically tune their hyperparameters to fairly compare their accuracy on downstream tasks. Our study offers a practical user guide and unveils several new insights. First, if tuned carefully, different PEFT methods achieve similar accuracy in the low-shot benchmark VTAB-1K. This includes simple approaches like FT the bias terms that were reported inferior. Second, despite similar accuracy, we find that PEFT methods make different mistakes and high-confidence predictions, likely due to their different inductive biases. Such an inconsistency (or complementarity) opens up the opportunity for ensemble methods, and we make preliminary attempts at this. Third, going beyond the commonly used low-shot tasks, we find that PEFT is also useful in many-shot regimes, achieving comparable or better accuracy than full FT while using significantly fewer parameters. Lastly, we investigate PEFT's ability to preserve a pre-trained model's robustness to distribution shifts (e.g., CLIP). Perhaps not surprisingly, PEFT approaches outperform full FT alone. However, with weight-space ensembles, full FT can better balance target distribution and distribution shift performance, suggesting a future research direction for robust PEFT.
Paper Structure (56 sections, 17 equations, 16 figures, 7 tables)

This paper contains 56 sections, 17 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Highlights of our insights. (a) Downstream accuracy: with proper implementation and fair tuning, different PEFT methods achieve similar accuracy ($\bullet$-$\bullet$: the range from the most to the least accurate methods) and consistently outperform linear probing ($\times$) and full FT ($\blacksquare$) on VTAB-1K. (b) Diverse predictions: despite reaching similar downstream performance, different PEFT methods produce diverse predictions. This opens new opportunities for ensemble approaches and other learning paradigms (e.g. semi-supervised learning) that can exploit the prediction discrepancies. (c) Distribution shift accuracy: FT a CLIP ViT-B/16, known for its generalizability across domains, with PEFT on ImageNet-1K (100 samples/class) better preserves the distribution shift accuracy (Y-axis, averaged across ImageNet-(V2, S, R, A) than full FT, evidenced by the $\star$ points. Interestingly, weight-space ensembles (WiSE) wortsman2022robust is applicable between PEFT's FT model and the pre-trained model ($\blacksquare$), but not as effective as applying it to the fully FT model. Details are in \ref{['sec:few']}, \ref{['sec:complememtary']} and \ref{['sec: robust']}.
  • Figure 2: Ranking frequency of 15 methods (14 PEFT + linear probing) for three groups in VTAB-1K. Element $(i, j)$ is the number of times method $i$ ranks $j^{th}$ in each group. Methods are ordered by mean ranks (in brackets). The parameters column shows the # of trainable parameters in millions. More details are in \ref{['-sec:results']}.
  • Figure 3: (a) Prediction similarity analysis: element $(i,j)$ shows the percentage of samples that method $i$ and $j$ predict the same. Although different methods achieve similar accuracy, they have diverse predictions. (b)The wrong prediction overlaps of LoRA, Adapter, and SSF for the 5K least confident data. Correct prediction overlaps for the 5K most confident data are shown in \ref{['fig:highconf']}. They are FT on CIFAR100 (VTAB-1K). More results for (a) and (b) are in \ref{['-sec:results']}.
  • Figure 4: Ensemble (majority vote) shows consistent gain on most datasets thanks to the diverse predictions.
  • Figure 5: PEFT accuracy in many-shot regimes, with different parameter sizes (X-axis) on three datasets from different domains. Even 2%-5% trainable parameters allow the models to have sufficient capacity to learn from full data. Details are in \ref{['-sec:results']}).
  • ...and 11 more figures