Table of Contents
Fetching ...

The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates

Giuseppe Russo Latona, Manoel Horta Ribeiro, Tim R. Davidson, Veniamin Veselovsky, Robert West

TL;DR

It is shown that AI-assisted reviews are consequential to the peer-review process and a discussion on future implications of current trends is offered.

Abstract

Journals and conferences worry that peer reviews assisted by artificial intelligence (AI), in particular, large language models (LLMs), may negatively influence the validity and fairness of the peer-review system, a cornerstone of modern science. In this work, we address this concern with a quasi-experimental study of the prevalence and impact of AI-assisted peer reviews in the context of the 2024 International Conference on Learning Representations (ICLR), a large and prestigious machine-learning conference. Our contributions are threefold. Firstly, we obtain a lower bound for the prevalence of AI-assisted reviews at ICLR 2024 using the GPTZero LLM detector, estimating that at least $15.8\%$ of reviews were written with AI assistance. Secondly, we estimate the impact of AI-assisted reviews on submission scores. Considering pairs of reviews with different scores assigned to the same paper, we find that in $53.4\%$ of pairs the AI-assisted review scores higher than the human review ($p = 0.002$; relative difference in probability of scoring higher: $+14.4\%$ in favor of AI-assisted reviews). Thirdly, we assess the impact of receiving an AI-assisted peer review on submission acceptance. In a matched study, submissions near the acceptance threshold that received an AI-assisted peer review were $4.9$ percentage points ($p = 0.024$) more likely to be accepted than submissions that did not. Overall, we show that AI-assisted reviews are consequential to the peer-review process and offer a discussion on future implications of current trends

The AI Review Lottery: Widespread AI-Assisted Peer Reviews Boost Paper Scores and Acceptance Rates

TL;DR

It is shown that AI-assisted reviews are consequential to the peer-review process and a discussion on future implications of current trends is offered.

Abstract

Journals and conferences worry that peer reviews assisted by artificial intelligence (AI), in particular, large language models (LLMs), may negatively influence the validity and fairness of the peer-review system, a cornerstone of modern science. In this work, we address this concern with a quasi-experimental study of the prevalence and impact of AI-assisted peer reviews in the context of the 2024 International Conference on Learning Representations (ICLR), a large and prestigious machine-learning conference. Our contributions are threefold. Firstly, we obtain a lower bound for the prevalence of AI-assisted reviews at ICLR 2024 using the GPTZero LLM detector, estimating that at least of reviews were written with AI assistance. Secondly, we estimate the impact of AI-assisted reviews on submission scores. Considering pairs of reviews with different scores assigned to the same paper, we find that in of pairs the AI-assisted review scores higher than the human review (; relative difference in probability of scoring higher: in favor of AI-assisted reviews). Thirdly, we assess the impact of receiving an AI-assisted peer review on submission acceptance. In a matched study, submissions near the acceptance threshold that received an AI-assisted peer review were percentage points () more likely to be accepted than submissions that did not. Overall, we show that AI-assisted reviews are consequential to the peer-review process and offer a discussion on future implications of current trends
Paper Structure (7 sections, 4 equations, 9 figures, 9 tables)

This paper contains 7 sections, 4 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Overview of our quasi-experimental approach to estimate the prevalence and causal effects of AI-assisted reviews.Study 1: Estimating the prevalence of AI-assisted reviews by classifying each review as human or AI-assisted using an out-of-the-box LLM-detection model. Study 2: Estimating the effect of AI-assisted reviews on paper scores by comparing the scores of human and AI-assisted reviews assigned to the same paper (thus controlling for properties of the reviewed paper). Study 3: Estimating the effect of AI-assisted reviews on acceptance rate: we match papers into pairs $\langle i,j\rangle$ such that (1) $i$ and $j$ are similar in content, (2) $i$ and $j$ received the same number $m$ of reviews, (3) $i$ received exactly one AI-assisted review, and $j$ none, (4) $i$'s $m-1$ human scores are identical to $m-1$ of $j$'s $m$ human scores. We then estimate the causal effect of AI-assisted reviews on paper acceptance as the difference in acceptance rates between $i$ and $j$ in matched pairs.
  • Figure 2: Estimated prevalence of AI-assisted ICLR reviews 2018--2024 (Study 1). Using the LLM detector's predictions in pre-ChatGPT years (2018--2023) to calculate its false-positive rate, we estimate that 15.8% of reviews in 2024 were AI-assisted (prevalence minus projection in the plot). We estimated 95% confidence intervals using bootstrap resampling for the prevalence (gray line), but they are too small to be visible. For the projection (orange line; the average prevalence between 2018 and 2022), we plot an error bar corresponding to the prevalence ranges observed in previous years.
  • Figure 3: Mean submission-level differences between AI-assisted and human reviews as a function of human reference scores (Study 2). We consider submissions with at least three reviews, where at least one is AI-assisted and at least two are human. Then, we select a human review as the reference review (with score $r_\text{ref}$) and estimate the average difference between AI-assisted and human reviews ($r_\text{AI} - r_\text{h}$). In the plot, we show the average difference ($y$-axis) for each possible score of the reference review ($x$-axis). AI-assisted reviews consistently give higher scores than human reviews.
  • Figure 4: Effect of receiving an AI-assisted review on submission acceptance (Study 3). (A) We stratify the effect of AI-assisted reviews on submission acceptance by matched submissions' average score across the human reviews they received ($y$-axis). We find a particularly pronounced effect for "borderline" submissions (average score between 5 and 6), with an increased acceptance rate of 4.9 percentage points percentage points ($p=0.024$). Overall, we find that submissions that received an AI-assisted review are 3.1 percentage points percentage points more likely to be accepted ($p=0.024$). (B) Acceptance rate and (C) prevalence of submissions for submissions receiving only human reviews across human-score bins. E.g., 20.7% of submissions were in the $[5, 6)$ bin, and submissions receiving only human reviews in this bin were accepted 73.6% of the times.
  • Figure 5: Robustness of the AI-assisted reviews labeling threshold. The plots show the robustness of the 0.5 threshold used to label reviews as AI-assisted or human. The plots show the prevalence analysis (A), the reviews score difference analysis (B), and the acceptance analysis (C) when varying the threshold.
  • ...and 4 more figures