Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

Josh Goldman; John K. Tsotsos

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

Josh Goldman, John K. Tsotsos

TL;DR

The paper tackles the gap between benchmark performance and real-world safety in computer vision by arguing that representative withheld test sets and non-random dataset construction cannot reliably estimate real-world failure rates. It formulates a formal framework around the target population $T$ and true accuracy $\bar{p}$, demonstrates that random sampling is often infeasible, and derives error bounds showing substantial uncertainty under non-random sampling. The key contributions include a rigorous critique of withheld test sets using non-random sampling theory, an analysis showing dataset bias cannot be cured by more data, and guidance toward evaluating model decision-making and reliability rather than sole accuracy. The work highlights the practical impact: deploying high-performing models in safety-critical domains without understanding decision processes risks catastrophic failures, and it calls for evaluation methodologies that prioritize reasoning, robustness, and verifiability over traditional accuracy metrics.

Abstract

Deep neural networks have achieved impressive performance on many computer vision benchmarks in recent years. However, can we be confident that impressive performance on benchmarks will translate to strong performance in real-world environments? Many environments in the real world are safety critical, and even slight model failures can be catastrophic. Therefore, it is crucial to test models rigorously before deployment. We argue, through both statistical theory and empirical evidence, that selecting representative image datasets for testing a model is likely implausible in many domains. Furthermore, performance statistics calculated with non-representative image datasets are highly unreliable. As a consequence, we cannot guarantee that models which perform well on withheld test images will also perform well in the real world. Creating larger and larger datasets will not help, and bias aware datasets cannot solve this problem either. Ultimately, there is little statistical foundation for evaluating models using withheld test sets. We recommend that future evaluation methodologies focus on assessing a model's decision-making process, rather than metrics such as accuracy.

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

TL;DR

and true accuracy

, demonstrates that random sampling is often infeasible, and derives error bounds showing substantial uncertainty under non-random sampling. The key contributions include a rigorous critique of withheld test sets using non-random sampling theory, an analysis showing dataset bias cannot be cured by more data, and guidance toward evaluating model decision-making and reliability rather than sole accuracy. The work highlights the practical impact: deploying high-performing models in safety-critical domains without understanding decision processes risks catastrophic failures, and it calls for evaluation methodologies that prioritize reasoning, robustness, and verifiability over traditional accuracy metrics.

Abstract

Paper Structure (22 sections, 12 equations, 1 table)

This paper contains 22 sections, 12 equations, 1 table.

Introduction
Safety Requirements in the Real World
Formulating the Problem
Possible Sources of Bias
Current Datasets
Sampling Theory
When Can We Collect a Random Sample?
Statistical Error Estimates in Computer Vision
Inference with Non-Random Samples
Reliability Engineering
Life Data Analysis
The Accelerated Test Method
Structural Testing
Fault-Based Testing
Other Methods
...and 7 more sections

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

TL;DR

Abstract

Statistical Challenges with Dataset Construction: Why You Will Never Have Enough Images

Authors

TL;DR

Abstract

Table of Contents