Table of Contents
Fetching ...

DUPE: Detection Undermining via Prompt Engineering for Deepfake Text

James Weichert, Chinecherem Dimobi

TL;DR

This paper investigates the reliability of AI text detectors in educational settings by evaluating Kirchenbauer et al. watermarking, ZeroGPT, and GPTZero on a dataset of 212 human and 208 AI-generated essays. It demonstrates that paraphrasing AI-generated text with ChatGPT 3.5 can significantly degrade detector performance, yielding high attack success rates across detectors. Baseline results reveal nontrivial false positive/false negative rates, especially for watermarking and ZeroGPT, challenging claims of near-perfect accuracy. The work highlights ethical concerns about relying on detectors for academic integrity and urges cautious use coupled with human judgment and further detector refinement.

Abstract

As large language models (LLMs) become increasingly commonplace, concern about distinguishing between human and AI text increases as well. The growing power of these models is of particular concern to teachers, who may worry that students will use LLMs to write school assignments. Facing a technology with which they are unfamiliar, teachers may turn to publicly-available AI text detectors. Yet the accuracy of many of these detectors has not been thoroughly verified, posing potential harm to students who are falsely accused of academic dishonesty. In this paper, we evaluate three different AI text detectors-Kirchenbauer et al. watermarks, ZeroGPT, and GPTZero-against human and AI-generated essays. We find that watermarking results in a high false positive rate, and that ZeroGPT has both high false positive and false negative rates. Further, we are able to significantly increase the false negative rate of all detectors by using ChatGPT 3.5 to paraphrase the original AI-generated texts, thereby effectively bypassing the detectors.

DUPE: Detection Undermining via Prompt Engineering for Deepfake Text

TL;DR

This paper investigates the reliability of AI text detectors in educational settings by evaluating Kirchenbauer et al. watermarking, ZeroGPT, and GPTZero on a dataset of 212 human and 208 AI-generated essays. It demonstrates that paraphrasing AI-generated text with ChatGPT 3.5 can significantly degrade detector performance, yielding high attack success rates across detectors. Baseline results reveal nontrivial false positive/false negative rates, especially for watermarking and ZeroGPT, challenging claims of near-perfect accuracy. The work highlights ethical concerns about relying on detectors for academic integrity and urges cautious use coupled with human judgment and further detector refinement.

Abstract

As large language models (LLMs) become increasingly commonplace, concern about distinguishing between human and AI text increases as well. The growing power of these models is of particular concern to teachers, who may worry that students will use LLMs to write school assignments. Facing a technology with which they are unfamiliar, teachers may turn to publicly-available AI text detectors. Yet the accuracy of many of these detectors has not been thoroughly verified, posing potential harm to students who are falsely accused of academic dishonesty. In this paper, we evaluate three different AI text detectors-Kirchenbauer et al. watermarks, ZeroGPT, and GPTZero-against human and AI-generated essays. We find that watermarking results in a high false positive rate, and that ZeroGPT has both high false positive and false negative rates. Further, we are able to significantly increase the false negative rate of all detectors by using ChatGPT 3.5 to paraphrase the original AI-generated texts, thereby effectively bypassing the detectors.
Paper Structure (24 sections, 2 figures, 5 tables)

This paper contains 24 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: An excerpt from a watermarked GPT Neo text that includes multiple textual artifacts.
  • Figure 2: The scatter plot shows the strong ($r = -0.60$) negative relationship between a human text's perplexity and its ZeroGPT "AI %" rating.