Mapping the Increasing Use of LLMs in Scientific Papers
Weixin Liang, Yaohui Zhang, Zhengxuan Wu, Haley Lepp, Wenlong Ji, Xuandong Zhao, Hancheng Cao, Sheng Liu, Siyu He, Zhi Huang, Diyi Yang, Christopher Potts, Christopher D Manning, James Y. Zou
TL;DR
<3-5 sentence high-level summary> This paper addresses the question of how widely large language models (LLMs) influence scientific writing by estimating the prevalence of LLM-modified content at the population level across arXiv, bioRxiv, and Nature journals from 2020 to 2024. It introduces a distributional GPT quantification framework that operates on token-level distributions to infer the fraction of AI-altered sentences (\alpha) without labeling individual documents, and strengthens this with a two-stage, realistic LLM-data-generation pipeline and full-vocabulary estimation. An analysis of 950,965 papers reveals a steady rise in LLM-modified content, with the steepest growth in Computer Science (abstracts up to 17.5%), and comparatively lower increases in Mathematics and Nature venues; the study also identifies correlates such as higher first-author preprint activity, greater field crowding, and shorter papers. The findings have implications for publishing policy, the quality of LLM pretraining data (through sources like arXiv), and the need for transparent, scalable monitoring of AI-assisted scientific writing across disciplines.
Abstract
Scientific publishing lays the foundation of science by disseminating research findings, fostering collaboration, encouraging reproducibility, and ensuring that scientific knowledge is accessible, verifiable, and built upon over time. Recently, there has been immense speculation about how many people are using large language models (LLMs) like ChatGPT in their academic writing, and to what extent this tool might have an effect on global scientific practices. However, we lack a precise measure of the proportion of academic writing substantially modified or produced by LLMs. To address this gap, we conduct the first systematic, large-scale analysis across 950,965 papers published between January 2020 and February 2024 on the arXiv, bioRxiv, and Nature portfolio journals, using a population-level statistical framework to measure the prevalence of LLM-modified content over time. Our statistical estimation operates on the corpus level and is more robust than inference on individual instances. Our findings reveal a steady increase in LLM usage, with the largest and fastest growth observed in Computer Science papers (up to 17.5%). In comparison, Mathematics papers and the Nature portfolio showed the least LLM modification (up to 6.3%). Moreover, at an aggregate level, our analysis reveals that higher levels of LLM-modification are associated with papers whose first authors post preprints more frequently, papers in more crowded research areas, and papers of shorter lengths. Our findings suggests that LLMs are being broadly used in scientific writings.
