Human-LLM Coevolution: Evidence from Academic Writing
Mingmeng Geng, Roberto Trotta
TL;DR
The paper investigates how LLM usage reshapes academic writing by analyzing word frequencies in arXiv abstracts from 2018–2024. It employs monthly frequency calculations normalized per 10,000 abstracts and introduces the ratio $R_{ij}(T_1,T_2)$ to compare periods, further leveraging data from arXiv metadata and withdrawn papers. Key findings show a post-April 2024 decline in LLM-disfavored terms like $\textit{delve}$ and $\textit{intricate}$, while some favorable terms such as $\textit{significant}$ continue to rise, with detection results (e.g., Binoculars) being sensitive to prompts and less reliable for real-time classification. The study highlights human–LLM coevolution as a major factor in detection challenges, recommending population-level word-frequency analysis as a robust, long-term indicator of LLM impact on scholarly writing.
Abstract
With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve", starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency due to LLMs' disfavor.
