Table of Contents
Fetching ...

Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Rui Ye, Xianghe Pang, Jingyi Chai, Jiaao Chen, Zhenfei Yin, Zhen Xiang, Xiaowen Dong, Jing Shao, Siheng Chen

TL;DR

This paper investigates the risks of using large language models to perform scholarly peer review. It systematically analyzes explicit and implicit manipulation, along with inherent flaws such as hallucination and bias, using ICLR-2024 data and multiple LLMs. The findings show that authors can steer LLM reviews through covert content or by framing limitations, and that LLMs exhibit biased or inflated judgments under certain conditions. The work argues for robust safeguards, detection mechanisms, and a cautious, supplementary role for LLMs rather than full replacement of human reviewers.

Abstract

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.

Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

TL;DR

This paper investigates the risks of using large language models to perform scholarly peer review. It systematically analyzes explicit and implicit manipulation, along with inherent flaws such as hallucination and bias, using ICLR-2024 data and multiple LLMs. The findings show that authors can steer LLM reviews through covert content or by framing limitations, and that LLMs exhibit biased or inflated judgments under certain conditions. The work argues for robust safeguards, detection mechanisms, and a cautious, supplementary role for LLMs rather than full replacement of human reviewers.

Abstract

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.

Paper Structure

This paper contains 19 sections, 1 equation, 24 figures, 6 tables.

Figures (24)

  • Figure 1: (a) The academic community has begun exploring the feasibility of using LLMs for peer review, with many already adopting this practice. This paper uncovers a series of its potential risks. (b) By embedding small, white, manipulative text in the manuscript, authors can directly influence LLM reviewers to generate positive reviews. (c) Compared to human reviewers, LLM reviewers are significantly more likely to reiterate limitations explicitly disclosed by the authors (measured by the overlap of key points between two sequences). (d) LLM reviewers may assign disproportionately high scores even when provided with incomplete content (e.g., content with only title).
  • Figure 2: Rating comparisons between review systems before and after manipulation. The average rating increases significantly after manipulation, shifting from a borderline rating to a substantially positive rating. This indicates that LLMs can be explicitly manipulated to give review that clearly lean towards acceptance.
  • Figure 3: Systematic impact on the Top-30% papers. Each point (x,y) indicates that when x% of human reviews are replaced by LLM reviews, y% of top-30% papers are accordingly replaced with originally lower-ranking papers. The influence on ranking shifts becomes more pronounced as the replacement ratio increases.
  • Figure 4: Ranking changes when 5% of reviews are randomly replaced with LLM reviews (reviews without manipulation shown on the left while with manipulation shown on the right). Manipulated reviews cause more significant shifts in rankings compared to the scenario without manipulation. Notably, papers from all original sections show the potential to move into the highest ranking section.
  • Figure 5: A case of implicit manipulation (more in Figure \ref{['fig:implicit_case_app1']}, \ref{['fig:implicit_case_app2']}, \ref{['fig:implicit_case_app3']}.). LLMs tend to reiterate the limitations disclosed by authors in the paper. Texts with same background color share similar meaning.
  • ...and 19 more figures