Table of Contents
Fetching ...

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

Tzu-Ling Lin, Wei-Chih Chen, Teng-Fang Hsiao, Hou-I Liu, Ya-Hsin Yeh, Yu Kai Chan, Wen-Sheng Lien, Po-Yen Kuo, Philip S. Yu, Hong-Han Shuai

TL;DR

This paper investigates the robustness of large language models (LLMs) used as automated peer reviewers against textual adversarial attacks. It formulates two core tasks—Review Generation and Score Prediction—and introduces Attack Focus Localization to identify vulnerable regions in long manuscripts. Empirical results show substantial vulnerability to simple perturbations, with high Attack Success Rates and inflated scores, underscoring serious reliability concerns. The authors discuss defenses, policy implications, and the need for human-in-the-loop oversight to safeguard scholarly integrity in AI-assisted peer review. The work highlights critical risks and proposes concrete directions for improving resilience in automated reviewing systems.

Abstract

Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks

TL;DR

This paper investigates the robustness of large language models (LLMs) used as automated peer reviewers against textual adversarial attacks. It formulates two core tasks—Review Generation and Score Prediction—and introduces Attack Focus Localization to identify vulnerable regions in long manuscripts. Empirical results show substantial vulnerability to simple perturbations, with high Attack Success Rates and inflated scores, underscoring serious reliability concerns. The authors discuss defenses, policy implications, and the need for human-in-the-loop oversight to safeguard scholarly integrity in AI-assisted peer review. The work highlights critical risks and proposes concrete directions for improving resilience in automated reviewing systems.

Abstract

Peer review is essential for maintaining academic quality, but the increasing volume of submissions places a significant burden on reviewers. Large language models (LLMs) offer potential assistance in this process, yet their susceptibility to textual adversarial attacks raises reliability concerns. This paper investigates the robustness of LLMs used as automated reviewers in the presence of such attacks. We focus on three key questions: (1) The effectiveness of LLMs in generating reviews compared to human reviewers. (2) The impact of adversarial attacks on the reliability of LLM-generated reviews. (3) Challenges and potential mitigation strategies for LLM-based review. Our evaluation reveals significant vulnerabilities, as text manipulations can distort LLM assessments. We offer a comprehensive evaluation of LLM performance in automated peer reviewing and analyze its robustness against adversarial attacks. Our findings emphasize the importance of addressing adversarial risks to ensure AI strengthens, rather than compromises, the integrity of scholarly communication.

Paper Structure

This paper contains 39 sections, 8 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Illustration of a scenario where an LLM is under adversarial attack: the author may introduce specific patterns (e.g., typos) into the paper, causing the feedback generated by LLM reviewers to be misled.
  • Figure 2: The example of aspect-tagged review generated by GPT-4o-mini.
  • Figure 3: visualization of the average score of each aspect predicted by GPT-4o-mini. Clean denotes the prediction when input is not manipulated by StyleAdv.
  • Figure 4: Prompt template used for review generation.