Table of Contents
Fetching ...

A Practical Examination of AI-Generated Text Detectors for Large Language Models

Brian Tufts, Xuandong Zhao, Lei Li

TL;DR

This paper interrogates the practical reliability of AI-generated text detectors under black-box conditions across diverse tasks, languages, and generation models. It benchmarks seven detectors (three trained, four zero-shot) using unseen data and red-teaming prompts, revealing substantial variability and vulnerability to adversarial strategies. It argues that AUROC alone is insufficient for real-world detection, recommending $TPR@FPR$ as a more informative metric and showing that even rewriting or adversarial prompts can degrade detectability. The findings highlight significant limitations of current detectors and the need for robust, continual evaluation frameworks as new models emerge.

Abstract

The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, PHD, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate practical adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.

A Practical Examination of AI-Generated Text Detectors for Large Language Models

TL;DR

This paper interrogates the practical reliability of AI-generated text detectors under black-box conditions across diverse tasks, languages, and generation models. It benchmarks seven detectors (three trained, four zero-shot) using unseen data and red-teaming prompts, revealing substantial variability and vulnerability to adversarial strategies. It argues that AUROC alone is insufficient for real-world detection, recommending as a more informative metric and showing that even rewriting or adversarial prompts can degrade detectability. The findings highlight significant limitations of current detectors and the need for robust, continual evaluation frameworks as new models emerge.

Abstract

The proliferation of large language models has raised growing concerns about their misuse, particularly in cases where AI-generated text is falsely attributed to human authors. Machine-generated content detectors claim to effectively identify such text under various conditions and from any language model. This paper critically evaluates these claims by assessing several popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, PHD, LogRank, Binoculars) on a range of domains, datasets, and models that these detectors have not previously encountered. We employ various prompting strategies to simulate practical adversarial attacks, demonstrating that even moderate efforts can significantly evade detection. We emphasize the importance of the true positive rate at a specific false positive rate (TPR@FPR) metric and demonstrate that these detectors perform poorly in certain settings, with TPR@.01 as low as 0%. Our findings suggest that both trained and zero-shot detectors struggle to maintain high sensitivity while achieving a reasonable true positive rate.

Paper Structure

This paper contains 29 sections, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Pipeline for prompting and evaluation. Adversarial prompting and rewriting are applied to the LLMs. After collecting machine-generated text, AUROC and TPR@FPR are measured for each detector.
  • Figure 2: Comparison of average AUROC results for multilingual tasks across all detectors using different normal prompting and average TPR@0.01 across all detectors using normal, template, and rewrite prompting. Error bars show maximum and minimum performance across detectors.
  • Figure 3: Comparison of average AUROC results for English tasks across all detectors using different normal prompting and average TPR@0.01 across all detectors using normal, template, and rewrite prompting. Error bars show maximum and minimum performance across detectors.
  • Figure 4: Correlations between the TPR at various FPR rates and the overall AUROC score. AUROC score is more representative of the middle FPR rates, while this detection task is more concerned with the lower end of FPR.
  • Figure 5: AUROC and TPR@0.01 for each zero-shot method using various underlying models. Only Fast-DetectGPT and Binoculars show a significant change in performance with GPT2-Medium.