How Many Ratings per Item are Necessary for Reliable Significance Testing?
Christopher Homan, Flip Korn, Deepak Pandita, Chris Welty
TL;DR
The paper addresses how many ratings per item are needed to reliably test significance when responses vary across items. It introduces a two-stage probabilistic response model plus a multistage bootstrap simulator to estimate the required number of items $N$, responses per item $K$, and perturbation $\epsilon$ for NHST comparing two models against gold $G$ under metric $\Gamma$, while also providing power estimates $1-\beta$. Extending prior work, it demonstrates that common budgets (e.g., 5–10 responses per item) are often underpowered and shows how reallocating budget toward more responses per item or more items can improve reproducibility across seven real datasets. The results offer practical guidance for budgeting annotations in AI evaluation and highlight that the optimal $N$–$K$ trade-off is metric- and dataset-dependent. Overall, the approach enables robust, data-driven planning of benchmark data collection and statistical testing in the presence of response variance.
Abstract
A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.
