Table of Contents
Fetching ...

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun

TL;DR

This work examines the rising threat of In-Paper Prompt Injection (IPI) in AI-assisted peer review. It defines two attack paradigms—static and iterative—and evaluates their effectiveness against three frontier AI reviewers using 100 ICLR 2025 submissions. The results show substantial score inflation, high transferability across models, and partial success of a defense based on prompt detection, which can be bypassed by adaptive attackers. The findings highlight fundamental vulnerabilities in AI-assisted reviewing pipelines and call for robust safeguards to ensure the integrity of automated peer review in academic contexts.

Abstract

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

TL;DR

This work examines the rising threat of In-Paper Prompt Injection (IPI) in AI-assisted peer review. It defines two attack paradigms—static and iterative—and evaluates their effectiveness against three frontier AI reviewers using 100 ICLR 2025 submissions. The results show substantial score inflation, high transferability across models, and partial success of a defense based on prompt detection, which can be bypassed by adaptive attackers. The findings highlight fundamental vulnerabilities in AI-assisted reviewing pipelines and call for robust safeguards to ensure the integrity of automated peer review in academic contexts.

Abstract

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

Paper Structure

This paper contains 39 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: In the static attack, a fixed malicious prompt is embedded in the manuscript and submitted to an AI reviewer; this single, unchanging injection systematically biases the reviewer and produces elevated scores. In the iterative attack, an adversary uses a surrogate reviewer to optimize the injection prompt through repeated query–feedback cycles, yielding an enhanced prompt that more reliably induces higher scores from AI reviewers.
  • Figure 2: Distributions of (a) peer reviewer scores and (b) paper lengths in sampled 100 ICLR 2025 submissions.
  • Figure 3: Effect of Human Ratings. The data are grouped using equal-frequency binning, and each point in the line chart represents the average score within a bin containing the same number of papers. Static attacks are conducted with Attack Prompt 3, while iterative attacks are initialized from the same prompt.
  • Figure 4: Impact of Paper Length. The data are grouped using equal-frequency binning, and each point in the line chart represents the average score within a bin containing the same number of papers. Static attacks are conducted with Attack Prompt 3, while iterative attacks are initialized from the same prompt.
  • Figure 5: Comparison of human and AI reviewer ratings across models. Each subplot corresponds to one reviewer model, showing its assigned ratings against human ratings for the same set of papers.
  • ...and 1 more figures