Table of Contents
Fetching ...

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, James Zou

TL;DR

The paper tackles the decline in peer-review quality amid rising conference submissions by deploying a Review Feedback Agent that uses a multi-LLM pipeline to provide targeted feedback on reviewer comments. In a large-scale randomized study at ICLR 2025, feedback was offered to over 20,000 reviews, with most posts passing reliability checks and remaining optional for reviewers. The results show that AI feedback increased review updates, extended review length, and boosted author–reviewer engagement during rebuttals, while incorporation of feedback was substantial but varied by the amount of feedback provided. The work demonstrates that carefully designed, reliability-guarded AI feedback can meaningfully improve the specificity and usefulness of reviews at scale, without altering acceptance decisions, and it provides a public-facing implementation for broader adoption.

Abstract

Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

TL;DR

The paper tackles the decline in peer-review quality amid rising conference submissions by deploying a Review Feedback Agent that uses a multi-LLM pipeline to provide targeted feedback on reviewer comments. In a large-scale randomized study at ICLR 2025, feedback was offered to over 20,000 reviews, with most posts passing reliability checks and remaining optional for reviewers. The results show that AI feedback increased review updates, extended review length, and boosted author–reviewer engagement during rebuttals, while incorporation of feedback was substantial but varied by the amount of feedback provided. The work demonstrates that carefully designed, reliability-guarded AI feedback can meaningfully improve the specificity and usefulness of reviews at scale, without altering acceptance decisions, and it provides a public-facing implementation for broader adoption.

Abstract

Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.

Paper Structure

This paper contains 15 sections, 7 figures, 2 tables, 1 algorithm.

Figures (7)

  • Figure 1: (A) Randomized controlled study setup. Before the start of the review period, we randomly assigned all submissions to one of three groups to determine how many of its reviews received feedback: none, half, or all. When a review selected to receive feedback was submitted, the agent generated and posted feedback after 1 hour. Reviewers could update their review, optionally, based on the feedback until the end of the review period, which ran from October 14 to November 12, 2024. (B) Feedback categories. Our system is designed to address three main types of review comments. Here, we provide examples of comments that would receive feedback from our agent, as well as examples of the generated feedback. (C) Review Feedback Agent. Our system consists of five LLMs (Actors, Aggregator, Critic, and Formatter). Two parallel Actors generate the initial feedback, then pass it to the Aggregator, the Critic, and finally the Formatter. Finally, the feedback is passed through the reliability tests; upon successfully passing, the feedback is posted on a review. We provide examples of comments and feedback given to those comments by our system.
  • Figure 2: OpenReview interface. Here, we provide an example of feedback posted to a review on the OpenReview website (with consent from the reviewer). Feedback is only visible to the reviewer and the ICLR program chairs and was posted roughly one hour after the initial review was submitted.
  • Figure 3: (A) Feedback statistics. Among all ICLR 2025 reviews, 22,467 were randomly selected to receive feedback (feedback group), and 22,364 were randomly selected not to receive feedback (control group). Of those selected to receive feedback, 18,946 (42.3%) successfully received feedback, with 26.6% of those reviewers updating their reviews. (B) Update rates. (Top) Most reviews were submitted 2-3 days before the review deadline (November 4, 2024). (Bottom) Reviewers were more likely to update their review if they submitted it early relative to the deadline. Reviewers who received feedback were much more likely to update their reviews than those in the control group, with a difference of approximately 17 percentage points. (C) Average change in review length (measured as number of words). Review length is measured only for the following sections: summary, strengths, weaknesses, and questions. The difference in review length between the control and feedback groups is statistically significant ($^{**}$p $\leq$ 0.01), with being selected to receive feedback leading to an average increase of 14 words more (a 200% increase) in review length compared to the control group. The difference is more pronounced between the not-updated and updated groups ($^{***}$p $\leq$ 0.001).
  • Figure 4: (A) Overall incorporation statistics. Through our LLM-based incorporation analysis, we estimate that 23.6% of reviewers who were given feedback incorporated at least one feedback item they were given. This means that 89% of reviewers who updated their review after receiving feedback incorporated at least one item. (B) Feedback incorporation trends. Here, we illustrate the relationship between the number of feedback items reviewers who updated their review received and how many of those items they incorporated. In total, reviewers incorporated 12,222 feedback items. Notably, reviewers were more likely to incorporate feedback when given fewer items.
  • Figure 5: (A) Feedback clusters. We used an LLM to group all the feedback items we provided to reviewers into five distinct clusters based on the text. We found that nearly half of the feedback was directed at asking the reviewer to 'clarify methodological concerns to make their request specific and actionable.' The next most popular cluster was feedback asking the reviewer to 'clarify their request by adding specific analyses, baselines, or references.' (B) Incorporation rate by cluster. We measured the percentage of feedback items within each cluster that reviewers incorporated. Overall, 17.7% of all feedback was incorporated. When examined by cluster, incorporation rates ranged from 14% to 18%, with no statistically significant differences observed.
  • ...and 2 more figures