Table of Contents
Fetching ...

How Many Ratings per Item are Necessary for Reliable Significance Testing?

Christopher Homan, Flip Korn, Deepak Pandita, Chris Welty

TL;DR

The paper addresses how many ratings per item are needed to reliably test significance when responses vary across items. It introduces a two-stage probabilistic response model plus a multistage bootstrap simulator to estimate the required number of items $N$, responses per item $K$, and perturbation $\epsilon$ for NHST comparing two models against gold $G$ under metric $\Gamma$, while also providing power estimates $1-\beta$. Extending prior work, it demonstrates that common budgets (e.g., 5–10 responses per item) are often underpowered and shows how reallocating budget toward more responses per item or more items can improve reproducibility across seven real datasets. The results offer practical guidance for budgeting annotations in AI evaluation and highlight that the optimal $N$–$K$ trade-off is metric- and dataset-dependent. Overall, the approach enables robust, data-driven planning of benchmark data collection and statistical testing in the presence of response variance.

Abstract

A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.

How Many Ratings per Item are Necessary for Reliable Significance Testing?

TL;DR

The paper addresses how many ratings per item are needed to reliably test significance when responses vary across items. It introduces a two-stage probabilistic response model plus a multistage bootstrap simulator to estimate the required number of items , responses per item , and perturbation for NHST comparing two models against gold under metric , while also providing power estimates . Extending prior work, it demonstrates that common budgets (e.g., 5–10 responses per item) are often underpowered and shows how reallocating budget toward more responses per item or more items can improve reproducibility across seven real datasets. The results offer practical guidance for budgeting annotations in AI evaluation and highlight that the optimal trade-off is metric- and dataset-dependent. Overall, the approach enables robust, data-driven planning of benchmark data collection and statistical testing in the presence of response variance.

Abstract

A cornerstone of machine learning evaluation is the (often hidden) assumption that model and human responses are reliable enough to evaluate models against unitary, authoritative, ``gold standard'' data, via simple metrics such as accuracy, precision, and recall. The generative AI revolution would seem to explode this assumption, given the critical role stochastic inference plays. Yet, in spite of public demand for more transparency in AI -- along with strong evidence that humans are unreliable judges -- estimates of model reliability are conventionally based on, at most, a few output responses per input item. We adapt a method, previously used to evaluate the reliability of various metrics and estimators for machine learning evaluation, to determine whether an (existing or planned) dataset has enough responses per item to assure reliable null hypothesis statistical testing. We show that, for many common metrics, collecting even 5-10 responses per item (from each model and team of human evaluators) is not sufficient. We apply our methods to several of the very few extant gold standard test sets with multiple disaggregated responses per item and show that even these datasets lack enough responses per item. We show how our methods can help AI researchers make better decisions about how to collect data for AI evaluation.

Paper Structure

This paper contains 14 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Empirical CDFs of item-level response means and standard deviations in (a) the Stanford Toxicity dataset vs clipped, folded normal CDF with $\langle \mu=0.19, \sigma=0.11 \rangle$ and clipped triangular distribution CDF with $\langle a=-0.05, b=0.21, c=0.45 \rangle$, respectively; and (b) the MultiDomain-Agreement dataset vs truncated normal CDF with $\langle \mu=-0.5, \sigma=1 \rangle$ and truncated normal CDF with $\langle \mu=-0.3923, \sigma=0.8502 \rangle$, respectively.
  • Figure 2: p-value vs $K$ with $\Gamma_{\rm MAE}$ at various $N \times K$. Each data point is the estimated from $10,000$ samples.
  • Figure 3: p-value vs $K$ with $\Gamma_{\rm MAE}$ at various $N \times K$ for Toxicity at log-scale on the y-axis. Each data point is the estimated from 10000 samples.
  • Figure 4: p-value vs $K$ with a fixed budget $N \times K = 2500$ for various metrics. Each data point is estimated from $10,000$ samples.
  • Figure 5: Power Analysis of Toxicity data ($\epsilon=0.1)$. Each data point is the estimated from 1000 outer-level samples, each consisting of 10000 inner level samples.
  • ...and 6 more figures