A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

Riccardo Fogliato; Pratik Patil; Mathew Monfort; Pietro Perona

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

Riccardo Fogliato, Pratik Patil, Mathew Monfort, Pietro Perona

TL;DR

The paper addresses the costly challenge of estimating CV model accuracy with limited labeled data by introducing a modular statistical framework that combines stratification, sampling, and estimation. It shows that stratification guided by accurate predictions of model performance, particularly via a $k$-means partition on $\mathbb{E}_P[Z|X]$, yields substantial efficiency gains (up to around 10x in some cases) over simple random sampling, and that model-assisted estimators using unlabeled data further reduce variance. The authors provide theoretical results linking optimal stratification and allocation to established survey-sampling criteria, and validate the approach through extensive CV experiments using CLIP and surrogate models, offering practical recommendations such as using SSRS with proportional allocation and calibrated proxies, with Neyman allocation as a potential boost when calibrations are reliable. They also discuss limitations like distribution shift and calibration needs, and suggest directions for deployment, including calibration, sequential sampling, and OOD considerations, to maximize real-world impact. The work thus provides a principled, actionable pathway to efficient model evaluation in CV, enabling more precise comparisons with far fewer annotated test examples.

Abstract

Model performance evaluation is a critical and expensive task in machine learning and computer vision. Without clear guidelines, practitioners often estimate model accuracy using a one-time completely random selection of the data. However, by employing tailored sampling and estimation strategies, one can obtain more precise estimates and reduce annotation costs. In this paper, we propose a statistical framework for model evaluation that includes stratification, sampling, and estimation components. We examine the statistical properties of each component and evaluate their efficiency (precision). One key result of our work is that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators. Our experiments on computer vision datasets show that this method consistently provides more precise accuracy estimates than the traditional simple random sampling, even with substantial efficiency gains of 10x. We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates based solely on the labeled data.

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

TL;DR

-means partition on

, yields substantial efficiency gains (up to around 10x in some cases) over simple random sampling, and that model-assisted estimators using unlabeled data further reduce variance. The authors provide theoretical results linking optimal stratification and allocation to established survey-sampling criteria, and validate the approach through extensive CV experiments using CLIP and surrogate models, offering practical recommendations such as using SSRS with proportional allocation and calibrated proxies, with Neyman allocation as a potential boost when calibrations are reliable. They also discuss limitations like distribution shift and calibration needs, and suggest directions for deployment, including calibration, sequential sampling, and OOD considerations, to maximize real-world impact. The work thus provides a principled, actionable pathway to efficient model evaluation in CV, enabling more precise comparisons with far fewer annotated test examples.

Abstract

Paper Structure (39 sections, 4 theorems, 24 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 4 theorems, 24 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Contributions and outline.
Related Work
Related Work in Survey Sampling
Related Work in Machine Learning
Framework Overview
Formal Setup
Framework Overview
Prediction (of $Z$).
Stratification.
Sampling.
Estimation.
Design of Framework Components
Choosing the Sampling Design
Designing the Strata
...and 24 more sections

Key Result

Proposition 1

Under the setup of sec:setup,

Figures (7)

Figure 1: Mean squared errors (MSEs) of estimators across sampling designs. Estimates of zero-shot accuracy of ViT-B/32 in classification tasks on three datasets as a function of the amount of labeled data. Stratified sampling can dramatically reduce the number of annotations needed to accurately estimate the model accuracy compared to the naive average ($\mathtt{HT}$) under simple random sampling. Neyman allocation can sometimes further improve precision compared to proportional allocation. (From left to right) No savings on the Dmlab Frames dataset, about 5x savings on the Stanford Cars, and about 10x savings on CIFAR-10. Note that the efficiency (precision) gains vary considerably between datasets (analysis and discussion in \ref{['sec:results']}). In the absence of stratified sampling with $k$-means on model predictions, the difference estimator can also greatly help.
Figure 2: Comparison of efficiency across stratification procedures, sampling designs, and estimators. The violin plots illustrate the relative efficiency of the Horvitz-Thompson ($\mathtt{HT}$) estimator under simple random sampling ($\mathtt{SRS}$, red dashed line) compared to other survey sampling strategies and estimators (relative efficiency is $\textrm{MSE}_\pi(\widehat{\theta}_{\text{\tiny{EST}}})/\textrm{MSE}_{\mathtt{SRS}}(\widehat{\theta}_{\mathtt{HT}})$) for estimating the accuracy of CLIP ViT-B/32 on classification tasks in the benchmark. Lower values indicate larger efficiency gains compared to the baseline. The dots and lines represent the relative efficiencies of the sampling methods and estimators on the various tasks.
Figure 3: Characterization of efficiency gains. The left panel shows the mean squared error ($\mathrm{MSE}$) of the difference estimator ($\mathtt{DF}$) under simple random sampling ($\mathtt{SRS}$, corrected by $n/(1-f)$) as a function of the zero-shot classification accuracy $N^{-1}\sum_{i\in \mathcal{D}}Z$ of CLIP ViT-B/32 evaluated on the full test sets of the LAION CLIP benchmark tasks. We construct $\widehat{Z}$ using the predictions of CLIP with ViT-B/32 as backbones. Dashed lines correspond to the relative efficiencies of $1$ (highest line), $0.75$, and $0.5$ (lowest). In tasks where the model achieves higher classification accuracy, it also tends to have higher relative efficiency. The right panel shows the allocation of the annotation budget to each stratum through proportional and optimal (ideal based on $S_{Z_h}$ and actual based on $\widehat{S}_{\widehat{Z}h}$) allocations across three datasets. In practice, Neyman allocation provides efficiency gains over proportional allocation only on Stanford Cars.
Figure 4: Comparison of efficiency across stratification procedures, sampling designs, and estimators for estimating $\mathrm{MSE}$ and cross-entropy. We evaluate the zero-shot accuracy of CLIP ViT-B/32 and generate surrogate predictions using CLIP ViT-L/14, also in the zero-shot setting. For more details, refer to \ref{['fig:in_distribution_results']}.
Figure 5: Comparison of efficiency across sampling designs, estimators, and CLIP models in the zero-shot setting (ZS) and with linear probing (LP). In this figure, we present the results specifically for the proxy $\widehat{Z}$ of $Z$ built on the model being evaluated. For a more detailed explanation of the figure, please see \ref{['fig:in_distribution_results']}.
...and 2 more figures

Theorems & Definitions (4)

Proposition 1
Proposition 2
Corollary 3
Proposition 4

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

TL;DR

Abstract

A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (4)