Table of Contents
Fetching ...

Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors

Ying Zhou, Ben He, Le Sun

TL;DR

This work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors and constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities.

Abstract

With the launch of ChatGPT, large language models (LLMs) have attracted global attention. In the realm of article writing, LLMs have witnessed extensive utilization, giving rise to concerns related to intellectual property protection, personal privacy, and academic integrity. In response, AI-text detection has emerged to distinguish between human and machine-generated content. However, recent research indicates that these detection systems often lack robustness and struggle to effectively differentiate perturbed texts. Currently, there is a lack of systematic evaluations regarding detection performance in real-world applications, and a comprehensive examination of perturbation techniques and detector robustness is also absent. To bridge this gap, our work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors. Additionally, we have constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities. Furthermore, through adversarial learning experiments, we investigate the impact of perturbation data augmentation on the robustness of AI-text detectors. We have released our code and data at https://github.com/zhouying20/ai-text-detector-evaluation.

Navigating the Shadows: Unveiling Effective Disturbances for Modern AI Content Detectors

TL;DR

This work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors and constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities.

Abstract

With the launch of ChatGPT, large language models (LLMs) have attracted global attention. In the realm of article writing, LLMs have witnessed extensive utilization, giving rise to concerns related to intellectual property protection, personal privacy, and academic integrity. In response, AI-text detection has emerged to distinguish between human and machine-generated content. However, recent research indicates that these detection systems often lack robustness and struggle to effectively differentiate perturbed texts. Currently, there is a lack of systematic evaluations regarding detection performance in real-world applications, and a comprehensive examination of perturbation techniques and detector robustness is also absent. To bridge this gap, our work simulates real-world scenarios in both informal and professional writing, exploring the out-of-the-box performance of current detectors. Additionally, we have constructed 12 black-box text perturbation methods to assess the robustness of current detection models across various perturbation granularities. Furthermore, through adversarial learning experiments, we investigate the impact of perturbation data augmentation on the robustness of AI-text detectors. We have released our code and data at https://github.com/zhouying20/ai-text-detector-evaluation.
Paper Structure (32 sections, 2 figures, 10 tables)

This paper contains 32 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Performance of state-of-the-art AI-text detectors significantly decreases after introducing perturbation attacks. The green dashed threshold line represents the adversarially trained RoBERTa classifier detector, achieving a detection accuracy of 0.912 on the mixed test data of the original and perturbed text.
  • Figure 2: Gradual reduction in average ASR with an increase in the number of perturbed data augmentations. Meanwhile, the F1 score on unperturbed data remains relatively stable, around 0.98. Refer to Appendix \ref{['sec:appendix-exp']} for details.