Table of Contents
Fetching ...

Human-LLM Coevolution: Evidence from Academic Writing

Mingmeng Geng, Roberto Trotta

TL;DR

The paper investigates how LLM usage reshapes academic writing by analyzing word frequencies in arXiv abstracts from 2018–2024. It employs monthly frequency calculations normalized per 10,000 abstracts and introduces the ratio $R_{ij}(T_1,T_2)$ to compare periods, further leveraging data from arXiv metadata and withdrawn papers. Key findings show a post-April 2024 decline in LLM-disfavored terms like $\textit{delve}$ and $\textit{intricate}$, while some favorable terms such as $\textit{significant}$ continue to rise, with detection results (e.g., Binoculars) being sensitive to prompts and less reliable for real-time classification. The study highlights human–LLM coevolution as a major factor in detection challenges, recommending population-level word-frequency analysis as a robust, long-term indicator of LLM impact on scholarly writing.

Abstract

With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve", starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency due to LLMs' disfavor.

Human-LLM Coevolution: Evidence from Academic Writing

TL;DR

The paper investigates how LLM usage reshapes academic writing by analyzing word frequencies in arXiv abstracts from 2018–2024. It employs monthly frequency calculations normalized per 10,000 abstracts and introduces the ratio to compare periods, further leveraging data from arXiv metadata and withdrawn papers. Key findings show a post-April 2024 decline in LLM-disfavored terms like and , while some favorable terms such as continue to rise, with detection results (e.g., Binoculars) being sensitive to prompts and less reliable for real-time classification. The study highlights human–LLM coevolution as a major factor in detection challenges, recommending population-level word-frequency analysis as a robust, long-term indicator of LLM impact on scholarly writing.

Abstract

With a statistical analysis of arXiv paper abstracts, we report a marked drop in the frequency of several words previously identified as overused by ChatGPT, such as "delve", starting soon after they were pointed out in early 2024. The frequency of certain other words favored by ChatGPT, such as "significant", has instead kept increasing. These phenomena suggest that some authors of academic papers have adapted their use of large language models (LLMs), for example, by selecting outputs or applying modifications to the LLM-generated content. Such coevolution and cooperation of humans and LLMs thus introduce additional challenges to the detection of machine-generated text in real-world scenarios. Estimating the impact of LLMs on academic writing by examining word frequency remains feasible, and more attention should be paid to words that were already frequently employed, including those that have decreased in frequency due to LLMs' disfavor.

Paper Structure

This paper contains 8 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The frequency evolution of some words in arXiv abstracts (they were singled out around April 2024 as either favored or disfavored by ChatGPT).
  • Figure 2: Frequency of words in arXiv abstracts previously identified as indicative of LLM usage. All word frequencies are normalized based on 10,000 abstracts. Word groups a and b correspond to the average frequencies of the words in \ref{['wf_realm']} and \ref{['wf_commenable']}. The data for withdrawn papers represents a 12-month rolling average, labeled by "w".
  • Figure 3: Comparing the ratio of word frequency between Computer Science abstracts and other disciplines. Only words that appear at least 20 times on average per 10,000 abstracts are plotted.
  • Figure 4: Comparison of word frequencies before and after LLM processing (with prompts P1 or P2).
  • Figure 5: MGT detection results for real and LLM-processing abstracts (with prompts P1 or P2). A lower score indicates a greater probability that the text is machine-generated.
  • ...and 2 more figures