Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
Sungduk Yu, Man Luo, Avinash Madusu, Vasudev Lal, Phillip Howard
TL;DR
This work introduces the largest dataset to date comparing human-written and AI-written peer reviews for identical papers submitted to ICLR and NeurIPS, generated with five leading LLMs and paired with 18 AI-text detectors. It reveals that detecting AI-generated peer reviews at the individual-review level remains difficult under stringent false-positive constraints, and demonstrates a context-aware Anchor approach that leverages manuscript content to improve detection, especially for challenging models like GPT-4o. The study shows that AI-generated reviews are generally less grounded and more favorable, raising fairness concerns for score-driven decisions, and examines the robustness of detectors to prompt variations and AI-assisted editing. Overall, the dataset and methods provide a valuable resource for advancing responsible detection of AI-generated content in scientific peer review and motivate further research into preserving integrity in review workflows.
Abstract
Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.
