Table of Contents
Fetching ...

How LLMs Distort Our Written Language

Marwa Abdulhai, Isadora White, Yanming Wan, Ibrahim Qureshi, Joel Leibo, Max Kleiman-Weiner, Natasha Jaques

Abstract

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

How LLMs Distort Our Written Language

Abstract

Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.
Paper Structure (70 sections, 40 figures, 1 table)

This paper contains 70 sections, 40 figures, 1 table.

Figures (40)

  • Figure 1: LLM-generated revisions display a larger and more consistent semantic shift than human-written revisions of the same essays. Each point pair represents an essay from the ArgRewrite-v2 dataset before (D1) and after (D2) revision, embedded using MiniLM-L6-v2reimers2019sentencebertsentenceembeddingsusing sentence embeddings and projected into two dimensions via PCA, a common approach for analyzing semantic differences dhillon2015eigenwords. The top-left panel (grey) shows human revisions, while the remaining panels show revisions produced by prompting the LLM with human-written expert feedback and different edit instructions (see titles). Even the instruction to make minimal edits shows large shifts (top-right). Arrows indicate direction and magnitude of semantic change. Human revisions exhibit smaller and more diverse shifts, whereas LLM revisions produce large semantic shifts strongly aligned in a common direction, and to a region of space not previously occupied by any human-written essay.
  • Figure 2: Example of substantial argumentative rewriting by an LLM on essays from ArgRewrite-v2. Red text highlights segments removed or substantially reframed by the model, while green text highlights segments added to the human draft. The figure illustrates how LLM edits frequently alter the person's intended conclusions, removing content that makes a particular claim, and editing the essay to be more neutral or positive about the technology of self-driving cars. Further, LLM edits will often remove human colloquialisms, anecdotes, or examples, leading to repetitive writing that loses the person's voice.
  • Figure 3: Unique words in human-edited texts (left) versus AI-edited texts produced using gpt-5-mini (right) for the ArgRewrite-v2 analysis. Word size reflects relative frequency, highlighting stylistic and thematic differences between how humans and AI write about self-driving cars.
  • Figure 4: In a randomized controlled trial, users which engaged in heavy LLM-use report essays are less creative and not in their voice, and exhibit large, homogenizing semantic shifts in their writing.
  • Figure 5: Semantic shifts induced by human and LLM revisions for the ArgRewrite-v2 dataset. Each point pair represents an essay before (D1) and after (D2) revision, embedded using gemini-004 sentence embeddings and projected into two dimensions via PCA, a common approach for analyzing semantic differences dhillon2015eigenwords. The left panel shows human revisions, while the remaining panels show revisions produced by different LLMs without access to expert feedback. Arrows indicate the direction and magnitude of semantic change. Human revisions exhibit smaller, more varied semantic shifts, whereas LLM revisions produce larger shifts that are strongly aligned in a common direction, indicating a homogenization effect in semantic space.
  • ...and 35 more figures