Table of Contents
Fetching ...

Has the Machine Learning Review Process Become More Arbitrary as the Field Has Grown? The NeurIPS 2021 Consistency Experiment

Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan

TL;DR

This study replicates the NeurIPS 2014 consistency experiment at a larger scale to quantify randomness in the review process amid rapid growth in submissions. Using a duplicate-10% paper design reviewed by independent committees, it finds persistent arbitrariness: about 23% of duplicated papers show inconsistent accept/reject decisions, and roughly half of accepted papers could change under rerun. The results, consistent with prior work, suggest that increasing selectivity may heighten arbitrariness and highlight the inherent difficulty of objectively ranking research, prompting discussion on possible reforms and community practices. The analysis also reports on ethics-flag disagreements and AC/SAC feedback, emphasizing subjective elements in decision-making and the need to balance rigor with community burden. Overall, the paper argues for cautious interpretation of peer-review quality and encourages ongoing debate on improvements to the process.

Abstract

We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approximately half of the list of accepted papers would change if the review process were randomly rerun. Our analysis suggests that making the conference more selective would increase the arbitrariness of the process. Taken together with previous research, our results highlight the inherent difficulty of objectively measuring the quality of research, and suggest that authors should not be excessively discouraged by rejected work.

Has the Machine Learning Review Process Become More Arbitrary as the Field Has Grown? The NeurIPS 2021 Consistency Experiment

TL;DR

This study replicates the NeurIPS 2014 consistency experiment at a larger scale to quantify randomness in the review process amid rapid growth in submissions. Using a duplicate-10% paper design reviewed by independent committees, it finds persistent arbitrariness: about 23% of duplicated papers show inconsistent accept/reject decisions, and roughly half of accepted papers could change under rerun. The results, consistent with prior work, suggest that increasing selectivity may heighten arbitrariness and highlight the inherent difficulty of objectively ranking research, prompting discussion on possible reforms and community practices. The analysis also reports on ethics-flag disagreements and AC/SAC feedback, emphasizing subjective elements in decision-making and the need to balance rigor with community burden. Overall, the paper argues for cautious interpretation of peer-review quality and encourages ongoing debate on improvements to the process.

Abstract

We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approximately half of the list of accepted papers would change if the review process were randomly rerun. Our analysis suggests that making the conference more selective would increase the arbitrariness of the process. Taken together with previous research, our results highlight the inherent difficulty of objectively measuring the quality of research, and suggest that authors should not be excessively discouraged by rejected work.
Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: A scatter plot showing average scores of the two committees at the time that initial reviews were released (top) and at the time final scores were released (bottom) for each paper in the experiment. The area of each circle is linear in the number of papers that it represents. The Pearson correlation coefficient is 0.575 for initial scores and 0.586 for final scores.
  • Figure 2: Acceptance rates and amount of disagreement between the two committees using different acceptance thresholds. The gray curve shows the random baseline: for each potential acceptance rate, the fraction of papers for which there would be disagreement between two committees if their recommendations were made at random. Small green (respectively, brown) dots show, for each acceptance rate, the level of disagreement there would have been between the two committees if the papers with the highest average final (respectively, initial) reviewer scores were accepted. Error bars depict Wilson's confidence intervals.
  • Figure 3: The fraction of accepted papers that would be rejected by the other committee, using different acceptance thresholds. The gray line shows the random baseline: for each potential acceptance rate, the fraction of accepted papers that would be rejected by the other committee if recommendations were made at random. Small green (respectively, brown) dots show, for each acceptance rate, the fraction of accepted papers that would be rejected by the other committee if the papers with the highest average final (respectively, initial) reviewer scores were accepted.