"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou; Zhexin Zhang; Zhi Li; Limin Sun

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

Qin Zhou, Zhexin Zhang, Zhi Li, Limin Sun

TL;DR

This work examines the rising threat of In-Paper Prompt Injection (IPI) in AI-assisted peer review. It defines two attack paradigms—static and iterative—and evaluates their effectiveness against three frontier AI reviewers using 100 ICLR 2025 submissions. The results show substantial score inflation, high transferability across models, and partial success of a defense based on prompt detection, which can be bypassed by adaptive attackers. The findings highlight fundamental vulnerabilities in AI-assisted reviewing pipelines and call for robust safeguards to ensure the integrity of automated peer review in academic contexts.

Abstract

With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

TL;DR

Abstract

"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)