Table of Contents
Fetching ...

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

Liam Dugan, Alyssa Hwang, Filip Trhlik, Josh Magnus Ludan, Andrew Zhu, Hainiu Xu, Daphne Ippolito, Chris Callison-Burch

TL;DR

RAID introduces the Robust AI Detection benchmark, the largest shared dataset for evaluating machine-generated text detectors across 11 generator models, 8 domains, 11 adversarial attacks, and 4 decoding strategies, totaling over 6.2 million generations. It benchmarks 12 detectors (8 open-source, 4 closed-source) and reveals widespread robustness gaps, including high default FPRs for open-source detectors, substantial accuracy drops with simple generation variations, and detector vulnerability to domain-, model-, and attack-specific factors. The study advocates principled evaluation practices, such as per-domain FPR calibration and reporting across decoding strategies, and provides a leaderboard and RAID-extra to foster ongoing, community-driven improvement. By releasing RAID and RAID-extra, the authors aim to accelerate robust detector research and help society manage the harms associated with machine-generated text through better tools and shared benchmarks.

Abstract

Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging-lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors

TL;DR

RAID introduces the Robust AI Detection benchmark, the largest shared dataset for evaluating machine-generated text detectors across 11 generator models, 8 domains, 11 adversarial attacks, and 4 decoding strategies, totaling over 6.2 million generations. It benchmarks 12 detectors (8 open-source, 4 closed-source) and reveals widespread robustness gaps, including high default FPRs for open-source detectors, substantial accuracy drops with simple generation variations, and detector vulnerability to domain-, model-, and attack-specific factors. The study advocates principled evaluation practices, such as per-domain FPR calibration and reporting across decoding strategies, and provides a leaderboard and RAID-extra to foster ongoing, community-driven improvement. By releasing RAID and RAID-extra, the authors aim to accelerate robust detector research and help society manage the harms associated with machine-generated text through better tools and shared benchmarks.

Abstract

Many commercial and open-source models claim to detect machine-generated text with extremely high accuracy (99% or more). However, very few of these detectors are evaluated on shared benchmark datasets and even when they are, the datasets used for evaluation are insufficiently challenging-lacking variations in sampling strategy, adversarial attacks, and open-source generative models. In this work we present RAID: the largest and most challenging benchmark dataset for machine-generated text detection. RAID includes over 6 million generations spanning 11 models, 8 domains, 11 adversarial attacks and 4 decoding strategies. Using RAID, we evaluate the out-of-domain and adversarial robustness of 8 open- and 4 closed-source detectors and find that current detectors are easily fooled by adversarial attacks, variations in sampling strategies, repetition penalties, and unseen generative models. We release our data along with a leaderboard to encourage future research.
Paper Structure (89 sections, 3 equations, 12 figures, 18 tables)

This paper contains 89 sections, 3 equations, 12 figures, 18 tables.

Figures (12)

  • Figure 1: Detectors for machine-generated text are often highly performant on default model settings but fail to detect more unusual settings such as using random sampling with a repetition penalty.
  • Figure 2: An overview of the structure of the RAID dataset. We generate 2,000 continuations for every combination of domain, model, decoding, penalty, and adversarial attack. This results in roughly 6.2 million generations for testing. We then evaluate each detector on all pieces of generated text in the dataset.
  • Figure 3: Histogram of examples (y-axis) grouped by their repetitiveness measured via SelfBLEU score (x-axis). We see that both random sampling and repetition penalty greatly reduce repetitiveness for all models.
  • Figure 4: Detection accuracy (y-axis) vs. False Positive Rate (x-axis) for all detectors. We see that Binoculars works significantly better than other detectors at low FPR and that few detectors can operate at FPR<1%.
  • Figure 5: Accuracy at FPR=5% (y-axis) vs. decoding strategy across our three detector classes on non-adversarial text. We see that repetition penalty greatly reduces accuracy in both greedy decoding and sampling.
  • ...and 7 more figures