Writing as a testbed for open ended agents
Sian Gooding, Lucia Lopez-Rivilla, Edward Grefenstette
TL;DR
The paper treats open-ended writing as a rigorous testbed for autonomous LLM-based agents, examining how action diversity, human alignment, and iterative refinement co-determine document quality. By benchmarking three prominent LLMs—Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o—across $22$ documents and a curated set of $5{,}750$ actions per model, the study shows that broad action spaces alone do not guarantee improvements: models excel in adding content but underperform in evaluative, subtractive, and goal-aligned refinements, sometimes causing semantic drift over multiple revision steps. A mixed-methods evaluation pairing embedding-based diversity metrics with human judgments reveals model-specific biases and the limits of self-evaluation via prompting, underscoring the need for justification, context grounding, and robust filtering. The findings point to practical directions for building open-ended writing agents, including expanding actionable strategies beyond additive edits, integrating rationale generation, and designing interfaces and training procedures that align autonomous revisions with writer intent and document goals. Overall, the work advances a framework for open-ended agent benchmarking in writing and highlights broad challenges and potential solutions relevant to open-ended AI systems beyond text editing.
Abstract
Open-ended tasks are particularly challenging for LLMs due to the vast solution space, demanding both expansive exploration and adaptable strategies, especially when success lacks a clear, objective definition. Writing, with its vast solution space and subjective evaluation criteria, provides a compelling testbed for studying such problems. In this paper, we investigate the potential of LLMs to act as collaborative co-writers, capable of suggesting and implementing text improvements autonomously. We analyse three prominent LLMs - Gemini 1.5 Pro, Claude 3.5 Sonnet, and GPT-4o - focusing on how their action diversity, human alignment, and iterative improvement capabilities impact overall performance. This work establishes a framework for benchmarking autonomous writing agents and, more broadly, highlights fundamental challenges and potential solutions for building systems capable of excelling in diverse open-ended domains.
