Table of Contents
Fetching ...

The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models

Zhivar Sourati, Farzan Karimi-Malekabadi, Meltem Ozcan, Colin McDaniel, Alireza Ziabari, Jackson Trager, Ala Tak, Meng Chen, Fred Morstatter, Morteza Dehghani

TL;DR

The paper investigates how large language models (LLMs) erode linguistic diversity by homogenizing writing styles and distorting signals that link language to individual and social traits. Through four complementary studies—time-series shock analysis, Granger-causality tests, and controlled rewrites across demographic, personality, empathy, and moral domains—the authors show that LLM-enabled writing reduces variance in writing style while largely preserving semantics. They also find that lexical cues predictive of personal attributes become weaker or systematically biased after LLM rewrites, with implications for diagnostics, hiring, and cultural preservation. The work emphasizes the need to monitor and mitigate homogenization effects as AI-assisted communication becomes more pervasive, and provides publicly available data and code for replication.

Abstract

Language is far more than a communication tool. A wealth of information - including but not limited to the identities, psychological states, and social contexts of its users - can be gleaned through linguistic markers, and such insights are routinely leveraged across diverse fields ranging from product development and marketing to healthcare. In four studies utilizing experimental and observational methods, we demonstrate that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity and may interfere with the societal and psychological insights language provides. We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others - emphasizing conformity over individuality. By varying LLMs, prompts, classifiers, and contexts, we show that these trends are robust and consistent. Our findings highlight a wide array of risks associated with linguistic homogenization, including compromised diagnostic processes and personalization efforts, the exacerbation of existing divides and barriers to equity in settings like personnel selection where language plays a critical role in assessing candidates' qualifications, communication skills, and cultural fit, and the undermining of efforts for cultural preservation.

The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models

TL;DR

The paper investigates how large language models (LLMs) erode linguistic diversity by homogenizing writing styles and distorting signals that link language to individual and social traits. Through four complementary studies—time-series shock analysis, Granger-causality tests, and controlled rewrites across demographic, personality, empathy, and moral domains—the authors show that LLM-enabled writing reduces variance in writing style while largely preserving semantics. They also find that lexical cues predictive of personal attributes become weaker or systematically biased after LLM rewrites, with implications for diagnostics, hiring, and cultural preservation. The work emphasizes the need to monitor and mitigate homogenization effects as AI-assisted communication becomes more pervasive, and provides publicly available data and code for replication.

Abstract

Language is far more than a communication tool. A wealth of information - including but not limited to the identities, psychological states, and social contexts of its users - can be gleaned through linguistic markers, and such insights are routinely leveraged across diverse fields ranging from product development and marketing to healthcare. In four studies utilizing experimental and observational methods, we demonstrate that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity and may interfere with the societal and psychological insights language provides. We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others - emphasizing conformity over individuality. By varying LLMs, prompts, classifiers, and contexts, we show that these trends are robust and consistent. Our findings highlight a wide array of risks associated with linguistic homogenization, including compromised diagnostic processes and personalization efforts, the exacerbation of existing divides and barriers to equity in settings like personnel selection where language plays a critical role in assessing candidates' qualifications, communication skills, and cultural fit, and the undermining of efforts for cultural preservation.

Paper Structure

This paper contains 20 sections, 1 equation, 23 figures, 19 tables.

Figures (23)

  • Figure 1: Trends in the variance of writing complexity and the attribution rate of texts as AI-generated.
  • Figure 2: Semantic similarity between original and LLM-generated texts (with Rephrase prompt on GPT3.5) across different data sources.
  • Figure 3: Semantic similarity between original and LLM-generated texts (with Rephrase prompt on GPT3.5) across Essays, YourMorals, Congress, and Empathetic Conversations datasets.
  • Figure 4: The distribution of $\Delta$ (a proxy for imbalances between the predicted class frequencies on the original and LLM-rewritten texts) across different personal traits, focusing on the LLM-rewritten texts generated by GPT3.5. The $W$ statistics from the Wilcoxon test are displayed on top, with $p < .05$, $p < .01$, and $p < .001$, marked with *, **, ***, respectively.
  • Figure 5: Percentage of original texts with correct author attribute predictions that changed after LLM rewriting, grouped by the direction of change in predictions.
  • ...and 18 more figures