A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation
Riccardo Fogliato, Pratik Patil, Mathew Monfort, Pietro Perona
TL;DR
The paper addresses the costly challenge of estimating CV model accuracy with limited labeled data by introducing a modular statistical framework that combines stratification, sampling, and estimation. It shows that stratification guided by accurate predictions of model performance, particularly via a $k$-means partition on $\mathbb{E}_P[Z|X]$, yields substantial efficiency gains (up to around 10x in some cases) over simple random sampling, and that model-assisted estimators using unlabeled data further reduce variance. The authors provide theoretical results linking optimal stratification and allocation to established survey-sampling criteria, and validate the approach through extensive CV experiments using CLIP and surrogate models, offering practical recommendations such as using SSRS with proportional allocation and calibrated proxies, with Neyman allocation as a potential boost when calibrations are reliable. They also discuss limitations like distribution shift and calibration needs, and suggest directions for deployment, including calibration, sequential sampling, and OOD considerations, to maximize real-world impact. The work thus provides a principled, actionable pathway to efficient model evaluation in CV, enabling more precise comparisons with far fewer annotated test examples.
Abstract
Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time completely random selection of the data. However, by employing tailored sampling and estimation strategies, one can obtain more precise estimates and reduce annotation costs. In this paper, we propose a statistical framework for model evaluation that includes stratification, sampling, and estimation components. We examine the statistical properties of each component and evaluate their efficiency (precision). One key result of our work is that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. Our experiments on computer vision datasets show that this method consistently provides more precise accuracy estimates than the traditional simple random sampling, even with substantial efficiency gains of 10x. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates based solely on the labeled data.
