Table of Contents
Fetching ...

Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

Shoumik Saha, Soheil Feizi

TL;DR

This work investigates how AI-assisted polishing of human-written text confounds AI-detection systems. By introducing the 14.7K-sample APT-Eval benchmark with degree-based and percentage-based AI involvement across six domains, the authors systematically evaluate 12 detectors and optimize per-detector thresholds. The results reveal high false positive rates for minimally polished text, limited ability to distinguish degrees of AI involvement, and biases against older or smaller polishers, with domain-specific vulnerabilities. The study advocates probabilistic or tiered labeling, training on AI-polished data, and human oversight, and provides open access to the dataset to advance fairer, more robust AI-detection methods.

Abstract

The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Such classification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains 14.7K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

Almost AI, Almost Human: The Challenge of Detecting AI-Polished Writing

TL;DR

This work investigates how AI-assisted polishing of human-written text confounds AI-detection systems. By introducing the 14.7K-sample APT-Eval benchmark with degree-based and percentage-based AI involvement across six domains, the authors systematically evaluate 12 detectors and optimize per-detector thresholds. The results reveal high false positive rates for minimally polished text, limited ability to distinguish degrees of AI involvement, and biases against older or smaller polishers, with domain-specific vulnerabilities. The study advocates probabilistic or tiered labeling, training on AI-polished data, and human oversight, and provides open access to the dataset to advance fairer, more robust AI-detection methods.

Abstract

The growing use of large language models (LLMs) for text generation has led to widespread concerns about AI-generated content detection. However, an overlooked challenge is AI-polished text, where human-written content undergoes subtle refinements using AI tools. This raises a critical question: should minimally polished text be classified as AI-generated? Such classification can lead to false plagiarism accusations and misleading claims about AI prevalence in online content. In this study, we systematically evaluate twelve state-of-the-art AI-text detectors using our AI-Polished-Text Evaluation (APT-Eval) dataset, which contains 14.7K samples refined at varying AI-involvement levels. Our findings reveal that detectors frequently flag even minimally polished text as AI-generated, struggle to differentiate between degrees of AI involvement, and exhibit biases against older and smaller models. These limitations highlight the urgent need for more nuanced detection methodologies.

Paper Structure

This paper contains 26 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Random sample from our APT Eval dataset. Original HWT on left; Polished version on right.
  • Figure 2: Distribution of Semantic Similarity for Degree-based AI-Polished Texts by GPT-4o.
  • Figure 3: AI-text detection rate for degree-based AI-polished-texts (APT) by all detectors.
  • Figure 4: AI-text prediction score with 95% confidence interval for percentage-based AI-polished-texts by Llama3-8B.
  • Figure 5: AI-text detection rate for degree-based AI-polished-texts from different polisher LLMs.
  • ...and 13 more figures