Table of Contents
Fetching ...

Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

Tuhin Chakrabarty, Philippe Laban, Chien-Sheng Wu

TL;DR

This work investigates salvaging AI writing by identifying idiosyncrasies in LLM outputs and mitigating them via expert edits. It introduces a seven-category edit taxonomy derived from professional writers, builds the LAMP corpus of 1,057 LLM-generated paragraphs edited to yield 8,035 edits, and develops automatic detection and rewriting pipelines. A large-scale preference study shows writer-edited text is preferred over LLM-edited and LLM-generated text, though automatic editing approaches can approach human performance, indicating potential for scalable alignment. Overall, the research provides actionable design principles and a valuable dataset to improve human-AI co-writing while preserving linguistic diversity and expressiveness in creative writing.

Abstract

LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM generated text, formalizing it into a seven-category taxonomy (e.g. clichés, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, building on existing work in automatic editing we evaluated methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.

Can AI writing be salvaged? Mitigating Idiosyncrasies and Improving Human-AI Alignment in the Writing Process through Edits

TL;DR

This work investigates salvaging AI writing by identifying idiosyncrasies in LLM outputs and mitigating them via expert edits. It introduces a seven-category edit taxonomy derived from professional writers, builds the LAMP corpus of 1,057 LLM-generated paragraphs edited to yield 8,035 edits, and develops automatic detection and rewriting pipelines. A large-scale preference study shows writer-edited text is preferred over LLM-edited and LLM-generated text, though automatic editing approaches can approach human performance, indicating potential for scalable alignment. Overall, the research provides actionable design principles and a valuable dataset to improve human-AI co-writing while preserving linguistic diversity and expressiveness in creative writing.

Abstract

LLM-based applications are helping people write, and LLM-generated text is making its way into social media, journalism, and our classrooms. However, the differences between LLM-generated and human written text remain unclear. To explore this, we hired professional writers to edit paragraphs in several creative domains. We first found these writers agree on undesirable idiosyncrasies in LLM generated text, formalizing it into a seven-category taxonomy (e.g. clichés, unnecessary exposition). Second, we curated the LAMP corpus: 1,057 LLM-generated paragraphs edited by professional writers according to our taxonomy. Analysis of LAMP reveals that none of the LLMs used in our study (GPT4o, Claude-3.5-Sonnet, Llama-3.1-70b) outperform each other in terms of writing quality, revealing common limitations across model families. Third, building on existing work in automatic editing we evaluated methods to improve LLM-generated text. A large-scale preference annotation confirms that although experts largely prefer text edited by other experts, automatic editing methods show promise in improving alignment between LLM-generated and human-written text.
Paper Structure (44 sections, 10 figures, 19 tables)

This paper contains 44 sections, 10 figures, 19 tables.

Figures (10)

  • Figure 1: To align models to human preferences, human annotators are typically shown two responses and asked to choose the one they prefer. (i) The top portion of the Figure shows Traditional Alignment: it is often hard to compare two responses that differ widely. (ii) The bottom portion of the Figure shows Alignment via Edits where the original response is edited, allowing for a more granular comparison, with the edited version of the text naturally preferred over the original response.
  • Figure 2: The pipeline for data creation. Step 1) Extracting context-independent paragraphs from our respective sources Step 2) Using an LLM to automatically generate instructions for corresponding human-written text Step 3) Use the generated instructions grounded in real-world writing to elicit responses from LLMs to create $<$instructions,response$>$ pairs
  • Figure 3: Interface to collect edits from writers on LLM-generated text
  • Figure 4: Analysis of 1,057 paragraphs edited by 18 Writer participants, analyzing: (\ref{['fig:data_plot1']}) the edit operations they perform (insertions, deletions, etc.), (\ref{['fig:data_plot2']}) the writing quality scores they assign, (\ref{['fig:data_plot3']}) comparing writing quality scores across LLMs, (\ref{['fig:data_plot4']}) the relationship between IWQS and editing amount.
  • Figure 5: (\ref{['fig:data_plot5']}) the categories of edits they implement, and (\ref{['fig:data_plot6']}) the relationship between writing quality scores and error categories. (\ref{['fig:data_plot7']}) Distribution of semantic similarity scores for the edits in the dataset
  • ...and 5 more figures