Table of Contents
Fetching ...

How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

Prabhant Singh, Sibylle Hess, Joaquin Vanschoren

TL;DR

This paper questions the validity of prevailing SITE benchmarks used to evaluate transferability estimation metrics, arguing that the benchmark design—dominated by a few large architectures and a static leaderboard—artificially boosts metric performance. Through empirical ablations and a controlled static-ranker experiment, the authors show that simple, dataset-agnostic strategies can outperform sophisticated SITE methods and that score differences often do not map reliably to downstream accuracy gains. They diagnose three core flaws: unrealistic model spaces, static rankings, and poor fidelity between SITE scores and actual performance gaps. The authors propose concrete best practices (diverse model spaces and datasets, disclosure of code/data, and rank-dispersion-focused design) and provide a SITE benchmarking checklist to steer future work toward more realistic, practically useful evaluation.

Abstract

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

TL;DR

This paper questions the validity of prevailing SITE benchmarks used to evaluate transferability estimation metrics, arguing that the benchmark design—dominated by a few large architectures and a static leaderboard—artificially boosts metric performance. Through empirical ablations and a controlled static-ranker experiment, the authors show that simple, dataset-agnostic strategies can outperform sophisticated SITE methods and that score differences often do not map reliably to downstream accuracy gains. They diagnose three core flaws: unrealistic model spaces, static rankings, and poor fidelity between SITE scores and actual performance gaps. The authors propose concrete best practices (diverse model spaces and datasets, disclosure of code/data, and rank-dispersion-focused design) and provide a SITE benchmarking checklist to steer future work toward more realistic, practically useful evaluation.

Abstract

Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

Paper Structure

This paper contains 23 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Illustration of Source Independent Transferability Estimation (SITE): Given a set of pre-trained models (on the left), a SITE metric computes a score $T_m$ based on extracted features on a target dataset. The scores $T_m$ are used to rank the pre-trained models according to their transferability.
  • Figure 2: Performance of transferability metrics when we remove architectures of the same families. We sequentially remove the architectures denoted on the horizontal axis and report the achieved $\tau_w$. A pattern of performance decrease with every ablation is observed.
  • Figure 3: Visualization of the ranking distribution of models in the standard benchmark. The models at the top occupy the first ranks in most datasets.
  • Figure 4: Heatmap correlation of $\Delta_{Acc}$ and $\Delta_T$
  • Figure 5: Plot of GBC scores against the ground truth accuracy.
  • ...and 5 more figures