Table of Contents
Fetching ...

Identifying Statistical Bias in Dataset Replication

Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Jacob Steinhardt, Aleksander Madry

TL;DR

Dataset replication can diagnose whether benchmark progress generalizes beyond test sets. The authors identify a statistic-matching bias, where noisy selection-frequency readings used to construct ImageNet-v2 skew the replication, and develop both nonparametric (jackknife) and parametric (beta-binomial mixtures with splines) methods to debias the observed accuracy gap. Their debiased analysis reduces the v1-to-v2 accuracy drop from $11.7\% \pm 1.0\%$ to $3.6\% \pm 1.5\%$, indicating that much of the apparent drop is attributable to bias in the replication pipeline rather than genuine generalization failure. The work provides practical recommendations for recognizing and avoiding bias in dataset replication, with implications for distribution-shift research and adaptive overfitting across domains.

Abstract

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after controlling for a standard human-in-the-loop measure of data quality. We show that after correcting for the identified statistical bias, only an estimated $3.6\% \pm 1.5\%$ of the original $11.7\% \pm 1.0\%$ accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available at http://github.com/MadryLab/dataset-replication-analysis .

Identifying Statistical Bias in Dataset Replication

TL;DR

Dataset replication can diagnose whether benchmark progress generalizes beyond test sets. The authors identify a statistic-matching bias, where noisy selection-frequency readings used to construct ImageNet-v2 skew the replication, and develop both nonparametric (jackknife) and parametric (beta-binomial mixtures with splines) methods to debias the observed accuracy gap. Their debiased analysis reduces the v1-to-v2 accuracy drop from to , indicating that much of the apparent drop is attributable to bias in the replication pipeline rather than genuine generalization failure. The work provides practical recommendations for recognizing and avoiding bias in dataset replication, with implications for distribution-shift research and adaptive overfitting across domains.

Abstract

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after controlling for a standard human-in-the-loop measure of data quality. We show that after correcting for the identified statistical bias, only an estimated of the original accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available at http://github.com/MadryLab/dataset-replication-analysis .

Paper Structure

This paper contains 65 sections, 29 equations, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: The smallest, median, and largest selection frequency images from v1 corresponding to the "throne" class (description: the chair of state for a monarch, bishop, etc.; "the king sat on his throne"---the "throne" class was randomly chosen). The images become easier to identify as the labeled class as selection frequency increases; for additional context, we give a random sampling of selection frequency/image pairs in Appendix \ref{['app:exp_setup']}.
  • Figure 2: For an image $x$, the selection frequency statistic $s(x)$ described in Section \ref{['sec:id_bias']} is a single number in $[0, 1]$ that captures how "easy" a given image is to classify for humans. A distribution over images ($p_i(x)$) thus induces a one-dimensional distribution over selection frequencies ($p_i(s(x))$). In (a), we visualize such hypothetical selection frequency distributions for both the Flickr data distribution ($p_{flickr}(s(x))$) and the ImageNet-v1 data distribution ($p_1(s(x))$). In (b), we consider a case where we are given, for a specific image $x$, a noisy version of $s(x)$ ($\hat{s}(x)$). We visualize the corresponding distribution of the true selection frequency $s(x)$ given this noisy $\hat{s}(x) = 0.7$. As discussed in Section \ref{['sec:id_bias']}, note that even though $\hat{s}(x)$ is an unbiased estimate of $s(x)$, the most likely value of $s(x)$ for a given noisy reading of $\hat{s}(x)$ actually depends on the distribution from which $x$ is drawn. This is the driving phenomenon behind the observed bias between ImageNet and ImageNet-v2.
  • Figure 3: Illustrations accompanying the simple theoretical model. (a) In the simple model, we assume $p_1(s(x))$ and $p_{flickr}(s(x))$ are $\text{Beta}(\alpha+1,\beta)$ and $\text{Beta}(\alpha,\beta)$, respectively---this is visualized above for the case of $\alpha=\beta=2$. (b) The results of the simple model reveal that as more and more samples are used to estimate $s(x)$ for each image, the resulting ImageNet-v2 distribution tends towards the v1 distribution, but does not actually match the v1 sample for any finite number of samples per image.
  • Figure 4: Selection frequency histograms for v1 and v2 based on our selection frequency re-measurement experiment. Results indicate that v2 seems to have lower selection frequency.
  • Figure 5: A series of graphs, all demonstrating bias in estimators that condition on selection frequency. Left: The estimated population density of selection frequencies, calculated naïvely from samples. For a given number of annotators per image $n$, the corresponding line in the graph has equally spaced points of the form $(k/n, \sum \bm{1}_{\hat{s}=k/n})$. Middle: Model accuracy of a ResNet-26 conditioned on selection frequency; once again, we naïvely using empirical selection frequency in place of true selection frequency for conditioning. Just as in the left-most graph, for a given $n$-annotator line, points at $x=k/n$ in the graph correspond to the accuracy on images with observed selection frequency $k/n$. Right: Adjusted v1 versus v2 accuracy plots, calculated for varying numbers of annotators per image (with adjusted accuracy computed using the naïve estimator of Section \ref{['sec:naive']}). Each point in the plot corresponds to a trained model.
  • ...and 9 more figures