Table of Contents
Fetching ...

The Limits of Assumption-free Tests for Algorithm Performance

Yuetian Luo, Rina Foygel Barber

TL;DR

This work analyzes the fundamental limits of evaluating and comparing machine-learning algorithms when data are limited and the distribution is unknown. It formalizes two parallel questions—EvaluateAlg and EvaluateModel—and shows that, under universally valid black-box testing, inference about algorithm risk is hard unless the evaluation sample size $N$ is much larger than the evaluation train size $n$, with a precise non-asymptotic bound on test power. The authors show that, except in an extremely stable (consistency) regime, stability alone does not circumvent these hardness results, and analogous limits hold for algorithm comparison. They also demonstrate that simple, data-splitting tests can achieve near-optimal power in certain cases, and draw connections between evaluation and comparison via a transformed evaluation problem. The findings clarify when and how distribution-free inference for algorithm performance is possible and point to promising directions, such as conformal risk control, for overcoming these fundamental barriers in practice.

Abstract

Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm A at the problem of learning from a training set of size n, versus, how good is a particular fitted model produced by running A on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm A as a ``black box'' (i.e., we can only study the behavior of A empirically), there is a fundamental limit on our ability to carry out inference on the performance of A, unless the number of available data points $N$ is many times larger than the evaluation sample size $n$ of interest. On the other hand, evaluating the performance of a particular fitted model can be easy when evaluating an algorithm is hard. We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that the same hardness result still holds for the problem of evaluating the performance of A, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.

The Limits of Assumption-free Tests for Algorithm Performance

TL;DR

This work analyzes the fundamental limits of evaluating and comparing machine-learning algorithms when data are limited and the distribution is unknown. It formalizes two parallel questions—EvaluateAlg and EvaluateModel—and shows that, under universally valid black-box testing, inference about algorithm risk is hard unless the evaluation sample size is much larger than the evaluation train size , with a precise non-asymptotic bound on test power. The authors show that, except in an extremely stable (consistency) regime, stability alone does not circumvent these hardness results, and analogous limits hold for algorithm comparison. They also demonstrate that simple, data-splitting tests can achieve near-optimal power in certain cases, and draw connections between evaluation and comparison via a transformed evaluation problem. The findings clarify when and how distribution-free inference for algorithm performance is possible and point to promising directions, such as conformal risk control, for overcoming these fundamental barriers in practice.

Abstract

Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm A at the problem of learning from a training set of size n, versus, how good is a particular fitted model produced by running A on a particular training data set of size ? Our main results prove that, for any test that treats the algorithm A as a ``black box'' (i.e., we can only study the behavior of A empirically), there is a fundamental limit on our ability to carry out inference on the performance of A, unless the number of available data points is many times larger than the evaluation sample size of interest. On the other hand, evaluating the performance of a particular fitted model can be easy when evaluating an algorithm is hard. We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that the same hardness result still holds for the problem of evaluating the performance of A, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.
Paper Structure (52 sections, 10 theorems, 145 equations, 1 figure)

This paper contains 52 sections, 10 theorems, 145 equations, 1 figure.

Key Result

Theorem 1

Assume that either $|\mathcal{X}| = \infty$ or $|\mathcal{Y}| = \infty$, and that the loss $\ell$ takes values in $[0,B]$. Let ${\widehat{T}}$ be a black-box test (as in Definition def:black-box-test), and assume that ${\widehat{T}}$ satisfies the assumption-free validity condition eqn:validity. Let if we assume $\tilde{\tau} < R_P^{\max}$ so that the denominator is positive.

Figures (1)

  • Figure 1: An illustration of the phase transition for performing inference on $R_{P,n}(\mathcal{A})$, relative to the $\ell_1$- or $\ell_2$-stability of the algorithm. In the "consistency" regime, the questions EvaluateAlg and EvaluateModel are essentially equivalent (as discussed in Section \ref{['sec:stability_or_consistency']}), while in the "impossibility" regime, Theorem \ref{['thm:limits_evaluate_stability']} establishes fundamental limits for performing inference on the question EvaluateAlg. (Later on, in Theorem \ref{['thm:limits_compare_stability']}, we will see that a similar phase transition holds for the questions CompareAlg and CompareModel, as well.)

Theorems & Definitions (19)

  • Definition 1: Black-box test for algorithm evaluation
  • Theorem 1
  • Remark 1
  • Remark 2: Validity for randomized versus deterministic algorithms
  • Remark 3: The infinite cardinality assumption
  • Proposition 1
  • Theorem 2
  • Theorem 3
  • Definition 2: Algorithmic stability
  • Proposition 2
  • ...and 9 more