Table of Contents
Fetching ...

Back to the Basics on Predicting Transfer Performance

Levy Chaves, Eduardo Valle, Alceu Bissoto, Sandra Avila

TL;DR

This paper tackles the challenge of predicting transfer performance by benchmarking a wide set of transferability scorers with a rigorous experimental-design framework and introduces Back to Bayes, a three-level Bayesian hierarchical regression to fuse multiple scorers. The authors show that the aggregated analysis and bootstrapped benchmarking yield more reliable estimates than per-dataset evaluations, and demonstrate that while the ImageNet baseline remains a strong predictor, combining diverse scorers with ImageNet generally improves transferability predictions, especially on challenging medical datasets. Key contributions include a robust benchmark design with aggregated tau metrics and a principled method to calibrate scorers that can be reused across new target datasets. The work highlights the value of information fusion in transferability estimation and points to future work leveraging posterior uncertainty and broader transfer scenarios for practical deployment.

Abstract

In the evolving landscape of deep learning, selecting the best pre-trained models from a growing number of choices is a challenge. Transferability scorers propose alleviating this scenario, but their recent proliferation, ironically, poses the challenge of their own assessment. In this work, we propose both robust benchmark guidelines for transferability scorers, and a well-founded technique to combine multiple scorers, which we show consistently improves their results. We extensively evaluate 13 scorers from literature across 11 datasets, comprising generalist, fine-grained, and medical imaging datasets. We show that few scorers match the predictive performance of the simple raw metric of models on ImageNet, and that all predictors suffer on medical datasets. Our results highlight the potential of combining different information sources for reliably predicting transferability across varied domains.

Back to the Basics on Predicting Transfer Performance

TL;DR

This paper tackles the challenge of predicting transfer performance by benchmarking a wide set of transferability scorers with a rigorous experimental-design framework and introduces Back to Bayes, a three-level Bayesian hierarchical regression to fuse multiple scorers. The authors show that the aggregated analysis and bootstrapped benchmarking yield more reliable estimates than per-dataset evaluations, and demonstrate that while the ImageNet baseline remains a strong predictor, combining diverse scorers with ImageNet generally improves transferability predictions, especially on challenging medical datasets. Key contributions include a robust benchmark design with aggregated tau metrics and a principled method to calibrate scorers that can be reused across new target datasets. The work highlights the value of information fusion in transferability estimation and points to future work leveraging posterior uncertainty and broader transfer scenarios for practical deployment.

Abstract

In the evolving landscape of deep learning, selecting the best pre-trained models from a growing number of choices is a challenge. Transferability scorers propose alleviating this scenario, but their recent proliferation, ironically, poses the challenge of their own assessment. In this work, we propose both robust benchmark guidelines for transferability scorers, and a well-founded technique to combine multiple scorers, which we show consistently improves their results. We extensively evaluate 13 scorers from literature across 11 datasets, comprising generalist, fine-grained, and medical imaging datasets. We show that few scorers match the predictive performance of the simple raw metric of models on ImageNet, and that all predictors suffer on medical datasets. Our results highlight the potential of combining different information sources for reliably predicting transferability across varied domains.
Paper Structure (20 sections, 7 equations, 3 figures, 1 table)

This paper contains 20 sections, 7 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: SOTA scorers, ImageNet-based baseline, and Back to Bayes. The plots illustrate the techniques proposed in Section \ref{['sec:benchmarking']}: the combined analysis of all datasets and the use of bootstrapping. The combined analysis is possible due to the proposed aggregated tau (Section \ref{['sec:benchmark_standard']}), and the bootstrapping is employed to compute its 95%-confidence intervals, shown inside the brackets on each plot. Rather than individual data points, the main message of those plots is whether the data points concentrate at the main diagonal of the plot, showing the scorer's ability of matching the ranks of transfer scores and test metrics.
  • Figure 2: Weighted tau ($\tau_\text{w}\times100$), higher is better. Row groups, from top: ImageNet baseline, state-of-the-art, Back to Bayes, ablations. Dataset groups, from left: generalist, natural, artifacts, medical. Averaged: average of individual dataset outcomes. Combined: outcome on combined dataset measurements, with aggregated weighted tau ($\mathring{\tau_\text{w}}\times100$). The ridgeline plots show the distribution of 1000 bootstrap iterations, whose mean appears in figures. Unmodified SOTA: orange; Back to Bayes: blue. (Best in color, details in the text.)
  • Figure 3: Frozen feature extractors, weighted tau ($\tau_\text{w}\times100$), higher is better. Row groups, dataset groups, and colors as in Figure \ref{['fig:ridgeline_wtau']}).