Table of Contents
Fetching ...

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Weixin Liang, Zachary Izzo, Yaohui Zhang, Haley Lepp, Hancheng Cao, Xuandong Zhao, Lingjiao Chen, Haotian Ye, Sheng Liu, Zhi Huang, Daniel A. McFarland, James Y. Zou

TL;DR

<3-5 sentence high-level summary> This paper develops a population-level framework to quantify AI-modified content in large text corpora by estimating the AI-generated fraction α using a mixture model between human-written (P) and AI-generated (Q) text. It builds P and Q from reference human and AI reviews and employs maximum likelihood estimation to infer α in target corpora, avoiding per-document detection. Applied to peer reviews from major ML conferences after the release of ChatGPT and to Nature journals, the method finds notable AI-modified content in ML conferences but not in Nature, and reveals correlations with deadlines, citations, and reviewer engagement, as well as corpus-level homogenization trends. The work demonstrates strong empirical validity and computational efficiency, and argues for interdisciplinary exploration of how AI-generated text reshapes scientific information practices and peer review. It also discusses limitations and future directions for broader applicability and policy development.

Abstract

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

TL;DR

<3-5 sentence high-level summary> This paper develops a population-level framework to quantify AI-modified content in large text corpora by estimating the AI-generated fraction α using a mixture model between human-written (P) and AI-generated (Q) text. It builds P and Q from reference human and AI reviews and employs maximum likelihood estimation to infer α in target corpora, avoiding per-document detection. Applied to peer reviews from major ML conferences after the release of ChatGPT and to Nature journals, the method finds notable AI-modified content in ML conferences but not in Nature, and reveals correlations with deadlines, citations, and reviewer engagement, as well as corpus-level homogenization trends. The work demonstrates strong empirical validity and computational efficiency, and argues for interdisciplinary exploration of how AI-generated text reshapes scientific information practices and peer review. It also discusses limitations and future directions for broader applicability and policy development.

Abstract

We present an approach for estimating the fraction of text in a large corpus which is likely to be substantially modified or produced by a large language model (LLM). Our maximum likelihood model leverages expert-written and AI-generated reference texts to accurately and efficiently examine real-world LLM-use at the corpus level. We apply this approach to a case study of scientific peer review in AI conferences that took place after the release of ChatGPT: ICLR 2024, NeurIPS 2023, CoRL 2023 and EMNLP 2023. Our results suggest that between 6.5% and 16.9% of text submitted as peer reviews to these conferences could have been substantially modified by LLMs, i.e. beyond spell-checking or minor writing updates. The circumstances in which generated text occurs offer insight into user behavior: the estimated fraction of LLM-generated text is higher in reviews which report lower confidence, were submitted close to the deadline, and from reviewers who are less likely to respond to author rebuttals. We also observe corpus-level trends in generated text which may be too subtle to detect at the individual level, and discuss the implications of such trends on peer review. We call for future interdisciplinary work to examine how LLM use is changing our information and knowledge practices.
Paper Structure (51 sections, 3 theorems, 17 equations, 31 figures, 33 tables)

This paper contains 51 sections, 3 theorems, 17 equations, 31 figures, 33 tables.

Key Result

Theorem 9.1

Suppose that there exists a constant $\kappa>0$, such that $\frac{|P(x)-Q(x)|}{\max\{P^2(x),Q^2(x)\}} \geq \kappa$. Furthermore, suppose we have $n$ papers from the mixture, and the estimation of $P$ and $Q$ is perfect. Then the estimated solution $\hat{\alpha}$ on the finite samples is not too far with probability at least $1-\delta$.

Figures (31)

  • Figure 1: Shift in Adjective Frequency in ICLR 2024 Peer Reviews. We find a significant shift in the frequency of certain tokens in ICLR 2024, with adjectives such as “commendable”, “meticulous”, and “intricate” showing 9.8, 34.7, and 11.2-fold increases in probability of occurring in a sentence. We find a similar trend in NeurIPS but not in Nature Portfolio journals. Supp. Table \ref{['table:word_adj_list']} and Supp. Figure \ref{['fig:word-cloud-adj']} in the Appendix provide a visualization of the top 100 adjectives produced disproportionately by AI.
  • Figure 2: An overview of the method. We begin by generating a corpus of documents with known scientist or AI authorship. Using this historical data, we can estimate the scientist-written and AI text distributions $P$ and $Q$ and validate our method's performance on held-out data. Finally, we can use the estimated $P$ and $Q$ to estimate the fraction of AI-generated text in a target corpus.
  • Figure 3: Performance validation of our MLE estimator across ICLR '23, NeurIPS '22, and CoRL '22 reviews (all predating ChatGPT's launch) via the method described in Section \ref{['sec: val']}. Our algorithm demonstrates high accuracy with less than 2.4% prediction error in identifying the proportion of LLM-generated feedback within the validation set. See Supp. Table \ref{['tab:verification-adj-main']},\ref{['tab:verification-adj-main-nature']} for full results.
  • Figure 4: Temporal changes in the estimated $\alpha$ for several ML conferences and Nature Portfolio journals. The estimated $\alpha$ for all ML conferences increases sharply after the release of ChatGPT (denoted by the dotted vertical line), indicating that LLMs are being used in a small but significant way. Conversely, the $\alpha$ estimates for Nature Portfolio reviews do not exhibit a significant increase or rise above the margin of error in our validation experiments for $\alpha=0$. See Supp. Table \ref{['tab:main-result']},\ref{['tab: Nature trend']} for full results.
  • Figure 5: Robustness of the estimations to proofreading. Evaluating $\alpha$ after using LLMs for "proof-reading" (non-substantial editing) of peer reviews shows a minor, non-significant increase across conferences, confirming our method's sensitivity to text which was generated in significant part by LLMs, beyond simple proofreading. See Supp. Table \ref{['app: proofread']} for full results.
  • ...and 26 more figures

Theorems & Definitions (6)

  • Theorem 9.1
  • proof
  • Lemma 9.2
  • proof
  • Lemma 9.3
  • proof