Mapping the Increasing Use of LLMs in Scientific Papers

Weixin Liang; Yaohui Zhang; Zhengxuan Wu; Haley Lepp; Wenlong Ji; Xuandong Zhao; Hancheng Cao; Sheng Liu; Siyu He; Zhi Huang; Diyi Yang; Christopher Potts; Christopher D Manning; James Y. Zou

Mapping the Increasing Use of LLMs in Scientific Papers

Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts, Christopher D Manning, James Y. Zou

TL;DR

<3-5 sentence high-level summary> This paper addresses the question of how widely large language models (LLMs) influence scientific writing by estimating the prevalence of LLM-modified content at the population level across arXiv, bioRxiv, and Nature journals from 2020 to 2024. It introduces a distributional GPT quantification framework that operates on token-level distributions to infer the fraction of AI-altered sentences (\alpha) without labeling individual documents, and strengthens this with a two-stage, realistic LLM-data-generation pipeline and full-vocabulary estimation. An analysis of 950,965 papers reveals a steady rise in LLM-modified content, with the steepest growth in Computer Science (abstracts up to 17.5%), and comparatively lower increases in Mathematics and Nature venues; the study also identifies correlates such as higher first-author preprint activity, greater field crowding, and shorter papers. The findings have implications for publishing policy, the quality of LLM pretraining data (through sources like arXiv), and the need for transparent, scalable monitoring of AI-assisted scientific writing across disciplines.

Abstract

Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.

Mapping the Increasing Use of LLMs in Scientific Papers

TL;DR

Abstract

Paper Structure (31 sections, 15 figures)

This paper contains 31 sections, 15 figures.

Introduction
Related Work
GPT Detectors
Background: the distributional LLM quantification framework
Generating Realistic LLM-Produced Training Data
Using the Full Vocabulary for Estimation
Implementation and Validations
Data Collection and Sampling
Data Split, Model Fitting, and Evaluation
Main Results and Findings
Temporal Trends in AI-Modified Academic Writing
Setup
Results
Relationship Between First-Author Preprint Posting Frequency and GPT Usage
Relationship Between Paper Similarity and LLM Usage
...and 16 more sections

Figures (15)

Figure 1: Estimated Fraction of LLM-Modified Sentences across Academic Writing Venues over Time. This figure displays the fraction ($\alpha$) of sentences estimated to have been substantially modified by LLM in abstracts from various academic writing venues. The analysis includes five areas within arXiv (Computer Science, Electrical Engineering and Systems Science, Mathematics, Physics, Statistics), articles from bioRxiv, and a combined dataset from 15 journals within the Nature portfolio. Estimates are based on the distributional GPT quantification framework, which provides population-level estimates rather than individual document analysis. Each point in time is independently estimated, with no temporal smoothing or continuity assumptions applied. Error bars indicate 95% confidence intervals by bootstrap. Further analysis of paper introductions is presented in Figure \ref{['fig: temporal-introduction']}.
Figure 2: Word Frequency Shift in arXiv Computer Science abstracts over 14 years (2010-2024). The plot shows the frequency over time for the top 4 words most disproportionately used by LLM compared to humans, as measured by the log odds ratio. The words are: realm, intricate, showcasing, pivotal. These terms maintained a consistently low frequency in arXiv CS abstracts over more than a decade (2010--2022) but experienced a sudden surge in usage starting in 2023.
Figure 3: Fine-grained Validation of Model Performance Under Temporal Distribution Shift. We evaluate the accuracy of our models in estimating the fraction of LLM-modified content ($\alpha$) under a challenging temporal data split, where the validation data (sampled from 2022-01-01 to 2022-11-29) are temporally separated from the training data (collected up to 2020-12-31) by at least a year. The X-axis indicates the ground truth $\alpha$, while the Y-axis indicates the model's estimated $\alpha$. In all cases, the estimation error for $\alpha$ is less than 3.5%. The first 7 panels (a--g) are the validation on abstracts for each academic writing venue, while the later 6 panels (h--m) are the validation on introductions. We did not include bioRxiv introductions due to the unavailability of bulk PDF downloads. Error bars indicate 95% confidence intervals by bootstrap.
Figure 4: Papers authored by first authors who post preprints more frequently tend to have a higher fraction of LLM-modified content. Papers in arXiv Computer Science are stratified into two groups based on the preprint posting frequency of their first author, as measured by the number of first-authored preprints in the year. Error bars indicate 95% confidence intervals by bootstrap.
Figure 5: Papers in more crowded research areas tend to have a higher fraction of LLM-modified content. Papers in arXiv Computer Science are divided into two groups based on their abstract's embedding distance to their closest peer: papers more similar to their closest peer (below median distance) and papers less similar to their closest peer (above median distance). Error bars indicate 95% confidence intervals by bootstrap.
...and 10 more figures

Mapping the Increasing Use of LLMs in Scientific Papers

TL;DR

Abstract

Mapping the Increasing Use of LLMs in Scientific Papers

Authors

TL;DR

Abstract

Table of Contents

Figures (15)