Table of Contents
Fetching ...

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

Sungduk Yu, Man Luo, Avinash Madusu, Vasudev Lal, Phillip Howard

TL;DR

This work introduces the largest dataset to date comparing human-written and AI-written peer reviews for identical papers submitted to ICLR and NeurIPS, generated with five leading LLMs and paired with 18 AI-text detectors. It reveals that detecting AI-generated peer reviews at the individual-review level remains difficult under stringent false-positive constraints, and demonstrates a context-aware Anchor approach that leverages manuscript content to improve detection, especially for challenging models like GPT-4o. The study shows that AI-generated reviews are generally less grounded and more favorable, raising fairness concerns for score-driven decisions, and examines the robustness of detectors to prompt variations and AI-assisted editing. Overall, the dataset and methods provide a valuable resource for advancing responsible detection of AI-generated content in scientific peer review and motivate further research into preserving integrity in review workflows.

Abstract

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review

TL;DR

This work introduces the largest dataset to date comparing human-written and AI-written peer reviews for identical papers submitted to ICLR and NeurIPS, generated with five leading LLMs and paired with 18 AI-text detectors. It reveals that detecting AI-generated peer reviews at the individual-review level remains difficult under stringent false-positive constraints, and demonstrates a context-aware Anchor approach that leverages manuscript content to improve detection, especially for challenging models like GPT-4o. The study shows that AI-generated reviews are generally less grounded and more favorable, raising fairness concerns for score-driven decisions, and examines the robustness of detectors to prompt variations and AI-assisted editing. Overall, the dataset and methods provide a valuable resource for advancing responsible detection of AI-generated content in scientific peer review and motivate further research into preserving integrity in review workflows.

Abstract

Peer review is a critical process for ensuring the integrity of published scientific research. Confidence in this process is predicated on the assumption that experts in the relevant domain give careful consideration to the merits of manuscripts which are submitted for publication. With the recent rapid advancements in large language models (LLMs), a new risk to the peer review process is that negligent reviewers will rely on LLMs to perform the often time consuming process of reviewing a paper. However, there is a lack of existing resources for benchmarking the detectability of AI text in the domain of peer review. To address this deficiency, we introduce a comprehensive dataset containing a total of 788,984 AI-written peer reviews paired with corresponding human reviews, covering 8 years of papers submitted to each of two leading AI research conferences (ICLR and NeurIPS). We use this new resource to evaluate the ability of 18 existing AI text detection algorithms to distinguish between peer reviews fully written by humans and different state-of-the-art LLMs. Additionally, we explore a context-aware detection method called Anchor, which leverages manuscript content to detect AI-generated reviews, and analyze the sensitivity of detection models to LLM-assisted editing of human-written text. Our work reveals the difficulty of identifying AI-generated text at the individual peer review level, highlighting the urgent need for new tools and methods to detect this unethical use of generative AI. Our dataset is publicly available at: https://huggingface.co/datasets/IntelLabs/AI-Peer-Review-Detection-Benchmark.

Paper Structure

This paper contains 63 sections, 1 equation, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Left panel: our data construction pipeline. Right panel: our context-aware detection method (Anchor) specifically designed for AI-generated review detection evaluated in Section \ref{['sec:anchor']}.
  • Figure 2: ROC plots computed from the combined GPT-4o, Gemini, and Claude review calibration dataset, showing results for ICLR (left) and NeurIPS (right); AUC values are shown in parentheses.
  • Figure S1: Computation time (in seconds) for processing 100 samples. Each method was repeated 20 times to compute the mean and standard deviation. All methods were run on a single NVIDIA RTX A6000 GPU, except for Anchor, which used sequential API calls without GPU acceleration.
  • Figure S2: t-SNE visualization of sentence embeddings from AI-generated reviews in the ICLR2021 test set. Blue points represent reviews generated using the main score-aligned prompt, and orange points represent those from the alternative archetype-based prompt. The substantial overlap between the two distributions suggests that prompt variation does not cause major shifts in model outputs. Embeddings were computed using OpenAI’s text-embedding-3-small model; t-SNE was performed with 2 output dimensions and a perplexity of 30.
  • Figure S3: Difference between AI and human scores. For each matched review (aligned by paper ID and recommendation), score differences were computed and displayed as histograms. Scores range from 1 to 4 for all metrics except Confidence, which ranges from 1 to 5. Statistical significance was assessed using a two-sided Wilcoxon signed‐rank test, with p-values shown in the legend. This figure includes only NeurIPS2022--2024 and ICLR2024, because they are the onyl conferences that required reviewers to submit these scores in their review templates.