Beemo: Benchmark of Expert-edited Machine-generated Outputs

Ekaterina Artemova; Jason Lucas; Saranya Venkatraman; Jooyoung Lee; Sergei Tilga; Adaku Uchendu; Vladislav Mikhailov

Beemo: Benchmark of Expert-edited Machine-generated Outputs

Ekaterina Artemova, Jason Lucas, Saranya Venkatraman, Jooyoung Lee, Sergei Tilga, Adaku Uchendu, Vladislav Mikhailov

TL;DR

Beemo tackles the gap in MGT detection benchmarks by introducing a multi-author dataset that includes expert-edited and LLM-edited outputs across five use cases. It combines data from ten instruction-tuned LLMs and two additional LLM editors, yielding 19.6k texts for robust out-of-domain evaluation and analyses of detector robustness across 33 configurations. The study shows expert edits can effectively evade many detectors, while LLM edits similarly challenge detection, highlighting differences between zero-shot and pretrained detectors in generalization. The work emphasizes the need for continuous data updates, human baselines, and responsible release practices to advance practical MGT detection research.

Abstract

The rapid proliferation of large language models (LLMs) has increased the volume of machine-generated texts (MGTs) and blurred text authorship in various domains. However, most existing MGT benchmarks include single-author texts (human-written and machine-generated). This conventional design fails to capture more practical multi-author scenarios, where the user refines the LLM response for natural flow, coherence, and factual correctness. Our paper introduces the Benchmark of Expert-edited Machine-generated Outputs (Beemo), which includes 6.5k texts written by humans, generated by ten instruction-finetuned LLMs, and edited by experts for various use cases, ranging from creative writing to summarization. Beemo additionally comprises 13.1k machine-generated and LLM-edited texts, allowing for diverse MGT detection evaluation across various edit types. We document Beemo's creation protocol and present the results of benchmarking 33 configurations of MGT detectors in different experimental setups. We find that expert-based editing evades MGT detection, while LLM-edited texts are unlikely to be recognized as human-written. Beemo and all materials are publicly available.

Beemo: Benchmark of Expert-edited Machine-generated Outputs

TL;DR

Abstract

Beemo: Benchmark of Expert-edited Machine-generated Outputs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)