What Do Machine Learning Researchers Mean by "Reproducible"?
Edward Raff, Michel Benaroch, Sagar Samtani, Andrew L. Farris
TL;DR
The paper tackles the ambiguity around what researchers mean by reproducibility in AI/ML by analyzing how the community treats it across 101 papers since 2017 and extending ACM concepts into eight rigor themes: repeatability, reproducibility, replicability, adaptability, model selection, label/data quality, meta/incentives, and maintainability. It provides a structured taxonomy with explicit distinctions (e.g., surface vs. in-depth reproducibility; empirical vs. theoretical replicability) and maps interdependencies among rigor types. The authors highlight gaps such as the underexplored adaptability and the need for domain-specific benchmarks and incentive structures, arguing for track-level emphasis on rigor at major conferences. This reframing aims to enable clearer measurement, cross-field comparisons, and more robust, long-lived AI/ML results.
Abstract
The concern that Artificial Intelligence (AI) and Machine Learning (ML) are entering a "reproducibility crisis" has spurred significant research in the past few years. Yet with each paper, it is often unclear what someone means by "reproducibility". Our work attempts to clarify the scope of "reproducibility" as displayed by the community at large. In doing so, we propose to refine the research to eight general topic areas. In this light, we see that each of these areas contains many works that do not advertise themselves as being about "reproducibility", in part because they go back decades before the matter came to broader attention.
