Table of Contents
Fetching ...

Beemo: Benchmark of Expert-edited Machine-generated Outputs

Ekaterina Artemova, Jason Lucas, Saranya Venkatraman, Jooyoung Lee, Sergei Tilga, Adaku Uchendu, Vladislav Mikhailov

TL;DR

Beemo tackles the gap in MGT detection benchmarks by introducing a multi-author dataset that includes expert-edited and LLM-edited outputs across five use cases. It combines data from ten instruction-tuned LLMs and two additional LLM editors, yielding 19.6k texts for robust out-of-domain evaluation and analyses of detector robustness across 33 configurations. The study shows expert edits can effectively evade many detectors, while LLM edits similarly challenge detection, highlighting differences between zero-shot and pretrained detectors in generalization. The work emphasizes the need for continuous data updates, human baselines, and responsible release practices to advance practical MGT detection research.

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo's creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.

Beemo: Benchmark of Expert-edited Machine-generated Outputs

TL;DR

Beemo tackles the gap in MGT detection benchmarks by introducing a multi-author dataset that includes expert-edited and LLM-edited outputs across five use cases. It combines data from ten instruction-tuned LLMs and two additional LLM editors, yielding 19.6k texts for robust out-of-domain evaluation and analyses of detector robustness across 33 configurations. The study shows expert edits can effectively evade many detectors, while LLM edits similarly challenge detection, highlighting differences between zero-shot and pretrained detectors in generalization. The work emphasizes the need for continuous data updates, human baselines, and responsible release practices to advance practical MGT detection research.

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo's creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.

Paper Structure

This paper contains 49 sections, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Overview of the Beemo's creation pipeline. (a) Use No Robotsno_robots as the source of prompts and human-written responses across five categories. Generate responses from ten open-source instruction-finetuned LLMs. (b) Refine the LLMs' responses with a team of expert editors. (c) Refine the LLMs' responses using two state-of-the-art LLMs and editing prompts (P1-P3). Each of 2,187 instances includes nine text versions.
  • Figure 2: Distribution of edit percentages across five edit ranges for expert annotators, GPT-4o (P1, P2, P3), and Llama3.1-70B-Instruct (P1, P2, P3). The bars represent the number of instances falling within each edit percentage range for each editor type.
  • Figure 3: Comparison of average edit percentages among expert editors, GPT-4o, and Llama3.1-70B-Instruct. The bars represent the mean edit percentage for each editor type, with error bars indicating the standard deviation.
  • Figure 4: Results in the "Expert-edited" (label=0) vs. "machine-generated" (label=1) scenario divided into seven groups by the edit range.
  • Figure 5: Comparison of average edit percentages between GPT-4o and Llama3.1-70B-Instruct across three different prompts. The bars represent the mean edit percentage for each prompt, with error bars indicating the standard deviation.
  • ...and 1 more figures