Table of Contents
Fetching ...

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Florence d'Alché-Buc, Emily Fox, Hugo Larochelle

TL;DR

This paper analyzes NeurIPS 2019’s reproducibility program, which integrated a code submission policy, a reproducibility challenge, and a machine-learning reproducibility checklist to raise standards in ML research. It documents implementation details, community reception, and early impacts such as higher code availability and active reproducibility efforts, while noting that evidence for lasting improvements in paper quality remains inconclusive. The study highlights encouraging signs of increased transparency and collaboration, and discusses challenges and future directions for broader adoption across venues. Overall, it presents a case study on institutionalizing reproducibility practices to strengthen reliability and accelerate scientific progress in ML.

Abstract

One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative.

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

TL;DR

This paper analyzes NeurIPS 2019’s reproducibility program, which integrated a code submission policy, a reproducibility challenge, and a machine-learning reproducibility checklist to raise standards in ML research. It documents implementation details, community reception, and early impacts such as higher code availability and active reproducibility efforts, while noting that evidence for lasting improvements in paper quality remains inconclusive. The study highlights encouraging signs of increased transparency and collaboration, and discusses challenges and future directions for broader adoption across venues. Overall, it presents a case study on institutionalizing reproducibility practices to strengthen reliability and accelerate scientific progress in ML.

Abstract

One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative.

Paper Structure

This paper contains 12 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Reproducible Research. Adapted from: https://github.com/WhitakerLab/ReproducibleResearch
  • Figure 2: Effect of code submission policy. \ref{['fig:code_policy_a']} Link to code provided at initial submission and camera-ready, as a function of affiliation of the first and last authors. We observe for industry affiliated authors code is not provided in the initial submission, but later provided after camera ready. Overall, we observe authors from the academia are more prone to release the code of their papers. \ref{['fig:code_policy_b']} Acceptance rate of submissions as a function of affiliation of the first and last authors. The red dashed line shows the acceptance rate for all submissions. We observe industry affiliated authors have higher chance of acceptance.
  • Figure 3: \ref{['fig:code_policy_c']} Diagram representing the transition of the code availability from initial submission to camera-ready only for submissions with an author from the industry (first or last). \ref{['fig:code_policy_d']} Percentage of submissions reporting that they provided code on the checklist subsequently confirmed by the reviewers.
  • Figure 4: Author responses to all checklist questions for NeurIPS 2019 submitted papers.
  • Figure 5: Acceptance rate per question. The x-axis corresponds to the question number on the checklist. The numbers within each bar show the number of submissions for each answer. See Fig. \ref{['fig:checklist']} (and in Appendix Fig. \ref{['fig:ml_checklist']}) for text corresponding to each Question # (x-axis). The red dashed line shows the acceptance rate for all submissions.
  • ...and 4 more figures