Table of Contents
Fetching ...

Efficient Lifelong Model Evaluation in an Era of Rapid Progress

Ameya Prabhu, Vishaal Udandarao, Philip Torr, Matthias Bethge, Adel Bibi, Samuel Albanie

TL;DR

An efficient framework for model evaluation, Sort&Search (S&S), is introduced, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples and achieves highly-efficient approximate accuracy measurement.

Abstract

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.

Efficient Lifelong Model Evaluation in an Era of Rapid Progress

TL;DR

An efficient framework for model evaluation, Sort&Search (S&S), is introduced, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples and achieves highly-efficient approximate accuracy measurement.

Abstract

Standardized benchmarks drive progress in machine learning. However, with repeated testing, the risk of overfitting grows as algorithms over-exploit benchmark idiosyncrasies. In our work, we seek to mitigate this challenge by compiling ever-expanding large-scale benchmarks called Lifelong Benchmarks. These benchmarks introduce a major challenge: the high cost of evaluating a growing number of models across very large sample sets. To address this challenge, we introduce an efficient framework for model evaluation, Sort & Search (S&S)}, which reuses previously evaluated models by leveraging dynamic programming algorithms to selectively rank and sub-select test samples. To test our approach at scale, we create Lifelong-CIFAR10 and Lifelong-ImageNet, containing 1.69M and 1.98M test samples for classification. Extensive empirical evaluations across over 31,000 models demonstrate that S&S achieves highly-efficient approximate accuracy measurement, reducing compute cost from 180 GPU days to 5 GPU hours (about 1000x reduction) on a single A100 GPU, with low approximation error and memory cost of <100MB. Our work also highlights issues with current accuracy prediction metrics, suggesting a need to move towards sample-level evaluation metrics. We hope to guide future research by showing our method's bottleneck lies primarily in generalizing Sort beyond a single rank order and not in improving Search.
Paper Structure (29 sections, 5 theorems, 8 equations, 9 figures, 1 table)

This paper contains 29 sections, 5 theorems, 8 equations, 9 figures, 1 table.

Key Result

Theorem 2.1

Given any ground-truth vector $\mathbf{a}_{m+1}$, it is possible to construct a prediction vector $\mathbf{y}_{m+1}$ such that $E_{\text{agg}}(\mathbf{y}_{m+1},\mathbf{a}_{m+1}) = 0$ and $E(\mathbf{a}_{m+1}, \mathbf{y}_{m+1}) = 2. \text{min}(1 - |\mathbf{a}_{m+1}|/n, |\mathbf{a}_{m+1}|/n$)

Figures (9)

  • Figure 1: Efficient Lifelong Model Evaluation. Assume an initial pool of $n$ samples and $m$ models evaluated on these samples (left). Our goal is to efficiently evaluate a new model ($\text{insert}_{\mathcal{M}}$) at sub-linear cost (right top) and efficiently insert a new sample into the lifelong benchmark ($\text{insert}_{\mathcal{D}}$) by determining sample difficulty at sub-linear cost (right bottom). See \ref{['sec:preliminaries']} for more details.
  • Figure 2: Full Pipeline of Sort & Search. For efficiently evaluating new models, (Left) we first sort all data samples by difficulty (refer \ref{['sortsec']}) and (Right) then perform a uniform sampling followed by DP-search and extrapolation for yielding new model predictions (refer \ref{['efficient-selection-by-search']}). This entire framework can also be transposed to efficiently insert new samples (refer \ref{['efficient-insertion-section']}).
  • Figure 3: Main Results.(a,b) We achieve 99% cost-savings for new model evaluation on Lifelong-ImageNet and Lifelong-CIFAR10 showcasing the efficiency (MAE decays exponentially with $n'$) of Sort&Search. (c)S&S is more efficient and accurate compared to the baseline on Lifelong-ImageNet.
  • Figure 4: (a) We achieve accurate sample difficulty estimates on Lifelong-CIFAR10 ($<$0.15 MAE) at a fraction of the total number of models to be evaluated, thereby enabling cost-efficient sample insertion. (b,c,d), We analyse three design choices for better understanding S&S, using Lifelong-Imagenet.
  • Figure 5: Error Decomposition Analysis on Lifelong-CIFAR10 (left) and Lifelong-ImageNet (right). We observe that epistemic error (solid line) drops to 0 within only 100 to 1000 samples across both datasets, indicating this error cannot be reduced further by better sampling methods. The total error $E$ is almost entirely irreducible (Aleatoric), induced because new models do not perfectly align with the ranking order $\mathbf{P}^*$. This suggests generalizing beyond a single rank ordering, not better sampling strategies, should be the focus of subsequent research efforts.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Theorem 2.1
  • Theorem 3.1
  • Theorem
  • proof
  • Proposition H.1
  • proof
  • Theorem
  • proof