Table of Contents
Fetching ...

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Rounak Saha, Gurusha Juneja, Dayita Chaudhuri, Naveeja Sajeevan, Nihar B Shah, Danish Pruthi

Abstract

A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Abstract

A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.
Paper Structure (31 sections, 4 equations, 5 figures, 10 tables)

This paper contains 31 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Levels of AI-assistance.
  • Figure 2: Confusion matrices denoting % of reviews classified as "AI", "Mixed", "Human" by Pangram $\&$ GPTZero on hard subset.
  • Figure 3: LLMs inadvertently introduce new content rather than merely "polishing": Including paper manuscript, omitting explicit content-preservation instruction and specifying generous word limits in the prompt leads to LLMs introducing new content in the "polished" review. Interestingly, such reviews are more likely to be flagged as AI-generated by Pangram.
  • Figure 4: Distribution of similarity (maximum over all AI-generated references) across levels of AI assistance. Many individual AI-polished reviews receive similarity scores indistinguishable from those of AI$^*$ reviews.
  • Figure 5: Confusion matrices denoting % of reviews classified as "AI", "Mixed", "Human" by Pangram $\&$ GPTZero on easy subset.