Table of Contents
Fetching ...

AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

Keith Tyser, Ben Segev, Gaston Longhitano, Xin-Yu Zhang, Zachary Meeks, Jason Lee, Uday Garg, Nicholas Belsten, Avi Shporer, Madeleine Udell, Dov Te'eni, Iddo Drori

TL;DR

<3-5 sentence high-level summary>AI-Driven Review Systems explores scalable, bias-aware paper reviews by deploying three integrated AI-based platforms (OpenReviewer, Papers with Reviews, and Reviewer Arena) and a multi-method evaluation framework. It combines human and LLM assessments, automatic prediction of human preferences, and systematic discovery of LLM limitations to map where AI reviews align with or diverge from human judgments. The approach leverages role-playing editorial workflows, adaptive and fixed review questions, and multi-document context to deliver fast, high-quality reviews while mitigating misuse and bias. Publicly available reviews of arXiv and Nature papers demonstrate potential for improved efficiency, transparency, and trend analysis in scholarly reviewing, with careful attention to ethical and methodological safeguards.

Abstract

Automatic reviewing helps handle a large volume of papers, provides early feedback and quality control, reduces bias, and allows the analysis of trends. We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. Gathering human preference may be time-consuming; therefore, we also use an LLM to automatically evaluate reviews to increase sample efficiency while reducing bias. In addition to evaluating human and LLM preferences among LLM reviews, we fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We artificially introduce errors into papers and analyze the LLM's responses to identify limitations, use adaptive review questions, meta prompting, role-playing, integrate visual and textual analysis, use venue-specific reviewing materials, and predict human preferences, improving upon the limitations of the traditional review processes. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality. This work develops proof-of-concept LLM reviewing systems that quickly deliver consistent, high-quality reviews and evaluate their quality. We mitigate the risks of misuse, inflated review scores, overconfident ratings, and skewed score distributions by augmenting the LLM with multiple documents, including the review form, reviewer guide, code of ethics and conduct, area chair guidelines, and previous year statistics, by finding which errors and shortcomings of the paper may be detected by automated reviews, and evaluating pairwise reviewer preferences. This work identifies and addresses the limitations of using LLMs as reviewers and evaluators and enhances the quality of the reviewing process.

AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

TL;DR

<3-5 sentence high-level summary>AI-Driven Review Systems explores scalable, bias-aware paper reviews by deploying three integrated AI-based platforms (OpenReviewer, Papers with Reviews, and Reviewer Arena) and a multi-method evaluation framework. It combines human and LLM assessments, automatic prediction of human preferences, and systematic discovery of LLM limitations to map where AI reviews align with or diverge from human judgments. The approach leverages role-playing editorial workflows, adaptive and fixed review questions, and multi-document context to deliver fast, high-quality reviews while mitigating misuse and bias. Publicly available reviews of arXiv and Nature papers demonstrate potential for improved efficiency, transparency, and trend analysis in scholarly reviewing, with careful attention to ethical and methodological safeguards.

Abstract

Automatic reviewing helps handle a large volume of papers, provides early feedback and quality control, reduces bias, and allows the analysis of trends. We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. Gathering human preference may be time-consuming; therefore, we also use an LLM to automatically evaluate reviews to increase sample efficiency while reducing bias. In addition to evaluating human and LLM preferences among LLM reviews, we fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We artificially introduce errors into papers and analyze the LLM's responses to identify limitations, use adaptive review questions, meta prompting, role-playing, integrate visual and textual analysis, use venue-specific reviewing materials, and predict human preferences, improving upon the limitations of the traditional review processes. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality. This work develops proof-of-concept LLM reviewing systems that quickly deliver consistent, high-quality reviews and evaluate their quality. We mitigate the risks of misuse, inflated review scores, overconfident ratings, and skewed score distributions by augmenting the LLM with multiple documents, including the review form, reviewer guide, code of ethics and conduct, area chair guidelines, and previous year statistics, by finding which errors and shortcomings of the paper may be detected by automated reviews, and evaluating pairwise reviewer preferences. This work identifies and addresses the limitations of using LLMs as reviewers and evaluators and enhances the quality of the reviewing process.
Paper Structure (37 sections, 6 equations, 37 figures, 13 tables)

This paper contains 37 sections, 6 equations, 37 figures, 13 tables.

Figures (37)

  • Figure 1: OpenReviewer: A user uploads their paper, which is automatically reviewed, and receives the review along with instructions for revision. The user may provide feedback and upload a revised version.
  • Figure 2: Papers with Reviews: Our system collects papers from arXiv and open-access Nature journals, reviews, ranks, and displays their title, authors, abstract, review, and review score, linking back to the papers on arXiv and Nature. Users provide feedback on the reviews, which is then used to improve the automated review process.
  • Figure 3: Reviewer Arena: The paper is reviewed by human reviewers, three closed LLMs and an open LLM. The reviews are anonymous and human expert evaluators receive pairs of reviews. The experts say whether they prefer one review over another in a Reviewer Arena. The process is repeated using GPT-4 as the expert evaluator. The preferences are used to compute win rate matrices, reviewer scores and rankings.
  • Figure 4: Win rates between five reviewers (three closed LLMs, an open LLM, and a human reviewer) based on human preferences.
  • Figure 5: Win rates between five reviewers (three closed LLMs, an open LLM, and a human reviewer) based on GPT 4 Turbo preferences.
  • ...and 32 more figures