Has the Machine Learning Review Process Become More Arbitrary as the Field Has Grown? The NeurIPS 2021 Consistency Experiment
Alina Beygelzimer, Yann N. Dauphin, Percy Liang, Jennifer Wortman Vaughan
TL;DR
This study replicates the NeurIPS 2014 consistency experiment at a larger scale to quantify randomness in the review process amid rapid growth in submissions. Using a duplicate-10% paper design reviewed by independent committees, it finds persistent arbitrariness: about 23% of duplicated papers show inconsistent accept/reject decisions, and roughly half of accepted papers could change under rerun. The results, consistent with prior work, suggest that increasing selectivity may heighten arbitrariness and highlight the inherent difficulty of objectively ranking research, prompting discussion on possible reforms and community practices. The analysis also reports on ethics-flag disagreements and AC/SAC feedback, emphasizing subjective elements in decision-making and the need to balance rigor with community burden. Overall, the paper argues for cautious interpretation of peer-review quality and encourages ongoing debate on improvements to the process.
Abstract
We present the NeurIPS 2021 consistency experiment, a larger-scale variant of the 2014 NeurIPS experiment in which 10% of conference submissions were reviewed by two independent committees to quantify the randomness in the review process. We observe that the two committees disagree on their accept/reject recommendations for 23% of the papers and that, consistent with the results from 2014, approximately half of the list of accepted papers would change if the review process were randomly rerun. Our analysis suggests that making the conference more selective would increase the arbitrariness of the process. Taken together with previous research, our results highlight the inherent difficulty of objectively measuring the quality of research, and suggest that authors should not be excessively discouraged by rejected work.
