Table of Contents
Fetching ...

GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints

Soren DeHaan, Yuanze Liu, Johan Bollen, Sa'ul A. Blanco

TL;DR

The paper investigates whether LLMs appear in academic preprints and, if so, whether they are used for editing/translation or full generation. It introduces a hybrid method combining a naive Bayesian classifier over word frequencies using log-odds $LogOdds(W)$ scores and Pruned Exact Linear Time (PELT) changepoint detection to quantify stylistic segmentation along a manuscript. Applying the approach to arXiv data and GPT-3.5 Turbo regenerated text, the study finds that LLM usage is largely uniform and predominantly editing, with partial generation being uncommon; normalization by length removes the apparent link between LLM signals and segmentation. The findings have policy implications, supporting responsible disclosure and the use of LLMs as editing tools in scientific writing while maintaining vigilance against misuse.

Abstract

The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.

GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints

TL;DR

The paper investigates whether LLMs appear in academic preprints and, if so, whether they are used for editing/translation or full generation. It introduces a hybrid method combining a naive Bayesian classifier over word frequencies using log-odds scores and Pruned Exact Linear Time (PELT) changepoint detection to quantify stylistic segmentation along a manuscript. Applying the approach to arXiv data and GPT-3.5 Turbo regenerated text, the study finds that LLM usage is largely uniform and predominantly editing, with partial generation being uncommon; normalization by length removes the apparent link between LLM signals and segmentation. The findings have policy implications, supporting responsible disclosure and the use of LLMs as editing tools in scientific writing while maintaining vigilance against misuse.

Abstract

The proliferation of Large Language Models (LLMs) in late 2022 has impacted academic writing, threatening credibility, and causing institutional uncertainty. We seek to determine the degree to which LLMs are used to generate critical text as opposed to being used for editing, such as checking for grammar errors or inappropriate phrasing. In our study, we analyze arXiv papers for stylistic segmentation, which we measure by varying a PELT threshold against a Bayesian classifier trained on GPT-regenerated text. We find that LLM-attributed language is not predictive of stylistic segmentation, suggesting that when authors use LLMs, they do so uniformly, reducing the risk of hallucinations being introduced into academic preprints.

Paper Structure

This paper contains 13 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Histogram of classifier results on GPT-regenerated text.
  • Figure 2: Confusion matrix for the classifier.
  • Figure 3: The distribution difference of PELT thresholds for the segmented dataset.
  • Figure 4: The original data. Observe length as a confounding variable.
  • Figure 5: After normalization with Z-score.