Table of Contents
Fetching ...

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

Xu Luo, Ji Zhang, Lianli Gao, Heng Tao Shen, Jingkuan Song

TL;DR

FEWTRANS is established, a comprehensive benchmark containing 10 diverse datasets, and the Hyperparameter Ensemble (HPE) protocol is proposed to overcome the "validation set illusion" in data-scarce regimes and provide a rigorousuler to streamline reproducible advances in few-shot transfer learning research.

Abstract

Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols

TL;DR

FEWTRANS is established, a comprehensive benchmark containing 10 diverse datasets, and the Hyperparameter Ensemble (HPE) protocol is proposed to overcome the "validation set illusion" in data-scarce regimes and provide a rigorousuler to streamline reproducible advances in few-shot transfer learning research.

Abstract

Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.However, there lacks a unified, rigorous evaluation protocol that is both challenging and realistic for real-world usage. In this work, we establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets, and propose the Hyperparameter Ensemble (HPE) protocol to overcome the "validation set illusion" in data-scarce regimes. Our empirical findings demonstrate that the choice of pre-trained model is the dominant factor for performance, while many sophisticated transfer methods offer negligible practical advantages over a simple full-parameter fine-tuning baseline. To explain this surprising effectiveness, we provide an in-depth mechanistic analysis showing that full fine-tuning succeeds via distributed micro-adjustments and more flexible reshaping of high-level semantic presentations without suffering from overfitting. Additionally, we quantify the performance collapse of multimodal models in specialized domains as a result of linguistic rarity using adjusted Zipf frequency scores. By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research. We make the FEWTRANS benchmark publicly available at https://github.com/Frankluox/FewTrans.
Paper Structure (36 sections, 1 equation, 6 figures, 9 tables)

This paper contains 36 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Average accuracy and 95% confidence intervals of single-task few-shot transfer evaluation of a pretrained DINOv2-s model on EuroSAT eurosat.
  • Figure 2: The heatmaps showing how the few-shot transfer performance of a single 1-shot task sampled from EuroSAT changes with hyperparameters. We fix the number of epochs to $50$ in the left plot, and fix the head lr to $0.01$ in the second plot. The black rectangles highlight the optimal hyperparameter areas.
  • Figure 3: Cross-validation cannot find good hyperparameters when the number of shots is small, regardless of the domain shift between pretraining and downstream dataset. We use a subset class of ImageNet as the training set, and use the remaining part as the downstream dataset for the left plot.
  • Figure 4: Robustness and fairness analysis of the Hyperparameter Ensemble (HPE) protocol. Left: Positive Pearson correlation ($r=0.38$) between hyperparameter sensitivity and the HPE penalty, confirming that the protocol naturally penalizes volatile methods. Right: Stability of HPE Top-1 accuracy across various grid spacings (3x, 5x, 10x), demonstrating that the protocol effectively buffers against the specific choice of hyperparameter boundaries across all 10 datasets.
  • Figure 5: Analysis of the adaptation mechanism for full fine-tuning and LoRA. (Left) Layer-wise parameter update scales ($L_{2}$ norm of $\Delta W$ across 12 Transformer blocks, showing that Full fine-tuning relies on distributed micro-adjustments ($0.01 \sim 0.07$) to avoid overfitting. (Right) Feature distribution shift measured by Centered Kernel Alignment (CKA) similarity at deep layers, where Full fine-tuning demonstrates more effective reshaping of high-level semantic representations compared to the constrained adaptation of LoRA.
  • ...and 1 more figures