Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

Xu Luo; Ji Zhang; Lianli Gao; Heng Tao Shen; Jingkuan Song

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song

TL;DR

FEWTRANS is established, a comprehensive benchmark containing 10 diverse datasets, and the Hyperparameter Ensemble (HPE) protocol is proposed to overcome the "validation set illusion" in data-scarce regimes and provide a rigorousuler to streamline reproducible advances in few-shot transfer learning research.

Abstract

Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

TL;DR

Abstract

Paper Structure (36 sections, 1 equation, 6 figures, 9 tables)

This paper contains 36 sections, 1 equation, 6 figures, 9 tables.

Introduction
Related Work
The Problem of Few-shot Transfer Learning
Inappropriate Evaluation of Previous Methods
Large Performance Fluctuation Caused by Sampling
Unrealistic Model Selection
Optimal hyperparameters change from task to task
Few-shot transfer performance is sensitive to the choice of hyperparameters
Optimal hyperparameters change from dataset to dataset
Cross-validation fails to provide reliable estimation of hyperparameters
Other Considerations
No variation of the number of classes
No class imbalance
Datasets lack diversity, are too easy, and may have errors
Introducing the FewTrans Benchmark
...and 21 more sections

Figures (6)

Figure 1: Average accuracy and 95% confidence intervals of single-task few-shot transfer evaluation of a pretrained DINOv2-s model on EuroSAT eurosat.
Figure 2: The heatmaps showing how the few-shot transfer performance of a single 1-shot task sampled from EuroSAT changes with hyperparameters. We fix the number of epochs to $50$ in the left plot, and fix the head lr to $0.01$ in the second plot. The black rectangles highlight the optimal hyperparameter areas.
Figure 3: Cross-validation cannot find good hyperparameters when the number of shots is small, regardless of the domain shift between pretraining and downstream dataset. We use a subset class of ImageNet as the training set, and use the remaining part as the downstream dataset for the left plot.
Figure 4: Robustness and fairness analysis of the Hyperparameter Ensemble (HPE) protocol. Left: Positive Pearson correlation ($r=0.38$) between hyperparameter sensitivity and the HPE penalty, confirming that the protocol naturally penalizes volatile methods. Right: Stability of HPE Top-1 accuracy across various grid spacings (3x, 5x, 10x), demonstrating that the protocol effectively buffers against the specific choice of hyperparameter boundaries across all 10 datasets.
Figure 5: Analysis of the adaptation mechanism for full fine-tuning and LoRA. (Left) Layer-wise parameter update scales ($L_{2}$ norm of $\Delta W$ across 12 Transformer blocks, showing that Full fine-tuning relies on distributed micro-adjustments ($0.01 \sim 0.07$) to avoid overfitting. (Right) Feature distribution shift measured by Centered Kernel Alignment (CKA) similarity at deep layers, where Full fine-tuning demonstrates more effective reshaping of high-level semantic representations compared to the constrained adaptation of LoRA.
...and 1 more figures

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

TL;DR

Abstract

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

Authors

TL;DR

Abstract

Table of Contents

Figures (6)