Table of Contents
Fetching ...

Writing as a testbed for open ended agents

Sian Gooding, Lucia Lopez-Rivilla, Edward Grefenstette

TL;DR

The paper treats open-ended writing as a rigorous testbed for autonomous LLM-based agents, examining how action diversity, human alignment, and iterative refinement co-determine document quality. By benchmarking three prominent LLMs—Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o—across $22$ documents and a curated set of $5{,}750$ actions per model, the study shows that broad action spaces alone do not guarantee improvements: models excel in adding content but underperform in evaluative, subtractive, and goal-aligned refinements, sometimes causing semantic drift over multiple revision steps. A mixed-methods evaluation pairing embedding-based diversity metrics with human judgments reveals model-specific biases and the limits of self-evaluation via prompting, underscoring the need for justification, context grounding, and robust filtering. The findings point to practical directions for building open-ended writing agents, including expanding actionable strategies beyond additive edits, integrating rationale generation, and designing interfaces and training procedures that align autonomous revisions with writer intent and document goals. Overall, the work advances a framework for open-ended agent benchmarking in writing and highlights broad challenges and potential solutions relevant to open-ended AI systems beyond text editing.

Abstract

Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.

Writing as a testbed for open ended agents

TL;DR

The paper treats open-ended writing as a rigorous testbed for autonomous LLM-based agents, examining how action diversity, human alignment, and iterative refinement co-determine document quality. By benchmarking three prominent LLMs—Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o—across documents and a curated set of actions per model, the study shows that broad action spaces alone do not guarantee improvements: models excel in adding content but underperform in evaluative, subtractive, and goal-aligned refinements, sometimes causing semantic drift over multiple revision steps. A mixed-methods evaluation pairing embedding-based diversity metrics with human judgments reveals model-specific biases and the limits of self-evaluation via prompting, underscoring the need for justification, context grounding, and robust filtering. The findings point to practical directions for building open-ended writing agents, including expanding actionable strategies beyond additive edits, integrating rationale generation, and designing interfaces and training procedures that align autonomous revisions with writer intent and document goals. Overall, the work advances a framework for open-ended agent benchmarking in writing and highlights broad challenges and potential solutions relevant to open-ended AI systems beyond text editing.

Abstract

Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.

Paper Structure

This paper contains 20 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Example excerpt from a document with randomly sampled actions from models on the right. The plot on the left visualises the embedding space of actions relative to the document $D$, where more tailored actions (e.g., 5,6) are closer to $D$, while more general actions (e.g., 1,3,4) are further away.
  • Figure 2: This violin plot shows a comparison of the models on action-document similarity, measured using cosine distance. The document and actions are embedded in the same space, and similarity is computed as the cosine distance between the document and each action. Higher similarity values indicate actions that closely align with the original document, while lower values suggest more general or diverse modifications.
  • Figure 3: Comparison of the frequency of the most common verbs used in actions suggested by Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o, and human annotators for shared documents, highlighting commonalities and differences in their preferred improvement strategies.
  • Figure 4: Analysis of Direct Feedback Location: Percentage Distribution and Descriptive Statistics Across Models
  • Figure 5: Violin plot showing the distribution of positive percentages for actions suggested by Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. A value of 1 indicates full agreement among $10$ annotators that the action was beneficial, while a value of $0$ means none of the annotators rated the action positively.
  • ...and 4 more figures