The Limits of Assumption-free Tests for Algorithm Performance
Yuetian Luo, Rina Foygel Barber
TL;DR
This work analyzes the fundamental limits of evaluating and comparing machine-learning algorithms when data are limited and the distribution is unknown. It formalizes two parallel questions—EvaluateAlg and EvaluateModel—and shows that, under universally valid black-box testing, inference about algorithm risk is hard unless the evaluation sample size $N$ is much larger than the evaluation train size $n$, with a precise non-asymptotic bound on test power. The authors show that, except in an extremely stable (consistency) regime, stability alone does not circumvent these hardness results, and analogous limits hold for algorithm comparison. They also demonstrate that simple, data-splitting tests can achieve near-optimal power in certain cases, and draw connections between evaluation and comparison via a transformed evaluation problem. The findings clarify when and how distribution-free inference for algorithm performance is possible and point to promising directions, such as conformal risk control, for overcoming these fundamental barriers in practice.
Abstract
Algorithm evaluation and comparison are fundamental questions in machine learning and statistics -- how well does an algorithm perform at a given modeling task, and which algorithm performs best? Many methods have been developed to assess algorithm performance, often based around cross-validation type strategies, retraining the algorithm of interest on different subsets of the data and assessing its performance on the held-out data points. Despite the broad use of such procedures, the theoretical properties of these methods are not yet fully understood. In this work, we explore some fundamental limits for answering these questions with limited amounts of data. In particular, we make a distinction between two questions: how good is an algorithm A at the problem of learning from a training set of size n, versus, how good is a particular fitted model produced by running A on a particular training data set of size $n$? Our main results prove that, for any test that treats the algorithm A as a ``black box'' (i.e., we can only study the behavior of A empirically), there is a fundamental limit on our ability to carry out inference on the performance of A, unless the number of available data points $N$ is many times larger than the evaluation sample size $n$ of interest. On the other hand, evaluating the performance of a particular fitted model can be easy when evaluating an algorithm is hard. We also ask whether an assumption of algorithmic stability might be sufficient to circumvent this hardness result. Surprisingly, we find that the same hardness result still holds for the problem of evaluating the performance of A, aside from a high-stability regime where fitted models are essentially nonrandom. Finally, we also establish similar hardness results for the problem of comparing multiple algorithms.
