Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Rounak Saha; Gurusha Juneja; Dayita Chaudhuri; Naveeja Sajeevan; Nihar B Shah; Danish Pruthi

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Rounak Saha, Gurusha Juneja, Dayita Chaudhuri, Naveeja Sajeevan, Nihar B Shah, Danish Pruthi

Abstract

A number of scientific conferences and journals have recently enacted policies that prohibit LLM usage by peer reviewers, except for polishing, paraphrasing, and grammar correction of otherwise human-written reviews. But, are these policies enforceable? To answer this question, we assemble a dataset of peer reviews simulating multiple levels of human-AI collaboration, and evaluate five state-of-the-art detectors, including two commercial systems. Our analysis shows that all detectors misclassify a non-trivial fraction of LLM-polished reviews as AI-generated, thereby risking false accusations of academic misconduct. We further investigate whether peer-review-specific signals, including access to the paper manuscript and the constrained domain of scientific writing, can be leveraged to improve detection. While incorporating such signals yields measurable gains in some settings, we identify limitations in each approach and find that none meets the accuracy standards required for identifying AI use in peer reviews. Importantly, our results suggest that recent public estimates of AI use in peer reviews through the use of AI-text detectors should be interpreted with caution, as current detectors misclassify mixed reviews (collaborative human-AI outputs) as fully AI generated, potentially overstating the extent of policy violations.

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Abstract

Paper Structure (31 sections, 4 equations, 5 figures, 10 tables)

This paper contains 31 sections, 4 equations, 5 figures, 10 tables.

Introduction
Data.
Detectors evaluated.
Potential peer-review-specific advantages.
Main findings.
Related Work
Evaluation Framework
Experiments & Results
Can off-the-shelf AI detectors distinguish between different levels of AI assistance in peer reviews?
What additional context does the peer review setting provide, and is that useful for AI detection?
What is the impact of "humanization" (adversarial paraphrasing)?
Conclusion
Off-the-shelf Detectors
Descriptions
Pangram and GPTZero: Easy subset results
...and 16 more sections

Figures (5)

Figure 1: Levels of AI-assistance.
Figure 2: Confusion matrices denoting % of reviews classified as "AI", "Mixed", "Human" by Pangram $\&$ GPTZero on hard subset.
Figure 3: LLMs inadvertently introduce new content rather than merely "polishing": Including paper manuscript, omitting explicit content-preservation instruction and specifying generous word limits in the prompt leads to LLMs introducing new content in the "polished" review. Interestingly, such reviews are more likely to be flagged as AI-generated by Pangram.
Figure 4: Distribution of similarity (maximum over all AI-generated references) across levels of AI assistance. Many individual AI-polished reviews receive similarity scores indistinguishable from those of AI$^*$ reviews.
Figure 5: Confusion matrices denoting % of reviews classified as "AI", "Mixed", "Human" by Pangram $\&$ GPTZero on easy subset.

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Abstract

Policies Permitting LLM Use for Polishing Peer Reviews Are Currently Not Enforceable

Authors

Abstract

Table of Contents

Figures (5)