Table of Contents
Fetching ...

RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text

Liam Dugan, Daphne Ippolito, Arun Kirubarajan, Chris Callison-Burch

TL;DR

RoFT introduces a gamified web platform for evaluating how humans detect machine-generated text via a boundary-detection task across domains such as news and fiction. The system presents rounds that begin with human-written sentences and proceed with machine-generated continuations, requiring users to identify the exact boundary and provide explanations, with a boundary revealed after each attempt. A pilot study with about 1848 high-quality annotations shows modest exact-boundary accuracy (≈15.8%) and a detectable rightward shift in where judges locate the boundary, alongside a points-based scoring scheme to capture partial correct detections and insights into annotator behavior. The work demonstrates RoFT’s potential for large-scale analysis of decoding strategies, prompts, and human factors, and outlines plans to release a rich dataset of natural-language explanations while expanding gamification to scale data collection beyond paid crowdsourcing.

Abstract

In recent years, large neural networks for natural language generation (NLG) have made leaps and bounds in their ability to generate fluent text. However, the tasks of evaluating quality differences between NLG systems and understanding how humans perceive the generated text remain both crucial and difficult. In this system demonstration, we present Real or Fake Text (RoFT), a website that tackles both of these challenges by inviting users to try their hand at detecting machine-generated text in a variety of domains. We introduce a novel evaluation task based on detecting the boundary at which a text passage that starts off human-written transitions to being machine-generated. We show preliminary results of using RoFT to evaluate detection of machine-generated news articles.

RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text

TL;DR

RoFT introduces a gamified web platform for evaluating how humans detect machine-generated text via a boundary-detection task across domains such as news and fiction. The system presents rounds that begin with human-written sentences and proceed with machine-generated continuations, requiring users to identify the exact boundary and provide explanations, with a boundary revealed after each attempt. A pilot study with about 1848 high-quality annotations shows modest exact-boundary accuracy (≈15.8%) and a detectable rightward shift in where judges locate the boundary, alongside a points-based scoring scheme to capture partial correct detections and insights into annotator behavior. The work demonstrates RoFT’s potential for large-scale analysis of decoding strategies, prompts, and human factors, and outlines plans to release a rich dataset of natural-language explanations while expanding gamification to scale data collection beyond paid crowdsourcing.

Abstract

In recent years, large neural networks for natural language generation (NLG) have made leaps and bounds in their ability to generate fluent text. However, the tasks of evaluating quality differences between NLG systems and understanding how humans perceive the generated text remain both crucial and difficult. In this system demonstration, we present Real or Fake Text (RoFT), a website that tackles both of these challenges by inviting users to try their hand at detecting machine-generated text in a variety of domains. We introduce a novel evaluation task based on detecting the boundary at which a text passage that starts off human-written transitions to being machine-generated. We show preliminary results of using RoFT to evaluate detection of machine-generated news articles.

Paper Structure

This paper contains 24 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: A word cloud of common words that annotators used to describe why they thought sentences were machine-generated.
  • Figure 2: The user interface for annotation.
  • Figure 3: A user's profile page.
  • Figure 4: A histogram of the filtered annotation set grouped by the distance (in number of sentences) between the sentence selected by the annotator and the true boundary sentence.
  • Figure 5: In (a) we show the average number of points (Section 4.3) received per annotation in the filtered annotation set grouped by the temporal order in which they were shown to the annotators 0 (first) to 9 (last). In (b) we show average number of points received per item in the filtered annotation for each values of $p$ used for decoding. Error bars are standard deviation. No statistically significant trends were observed in this preliminary study.