Table of Contents
Fetching ...

GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors

Wenlong Meng, Shuguo Fan, Chengkun Wei, Min Chen, Yuwei Li, Yuanchao Zhang, Zhikun Zhang, Wenzhi Chen

TL;DR

GradEscape introduces a novel gradient-based evader for AI-generated text detectors, overcoming text discreteness by weighting detector embeddings with token probabilities and optimizing a seq2seq paraphrase under strict syntactic and semantic constraints. A warm-started evader enables cross-architecture attacks, while an opaque-model strategy combines tokenizer inference and model extraction to operate under query-only access. Empirical results across multiple datasets, detectors, and real-world services show GradEscape achieving superior evasion with a compact model (139M parameters) and low-cost inference, prompting a detector-independent defense based on active paraphrasing. The work collectively highlights vulnerabilities in current AIGT detectors and contributes open-source tools for robust detector development, alongside ethical safeguards and defense considerations.

Abstract

In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.

GradEscape: A Gradient-Based Evader Against AI-Generated Text Detectors

TL;DR

GradEscape introduces a novel gradient-based evader for AI-generated text detectors, overcoming text discreteness by weighting detector embeddings with token probabilities and optimizing a seq2seq paraphrase under strict syntactic and semantic constraints. A warm-started evader enables cross-architecture attacks, while an opaque-model strategy combines tokenizer inference and model extraction to operate under query-only access. Empirical results across multiple datasets, detectors, and real-world services show GradEscape achieving superior evasion with a compact model (139M parameters) and low-cost inference, prompting a detector-independent defense based on active paraphrasing. The work collectively highlights vulnerabilities in current AIGT detectors and contributes open-source tools for robust detector development, alongside ethical safeguards and defense considerations.

Abstract

In this paper, we introduce GradEscape, the first gradient-based evader designed to attack AI-generated text (AIGT) detectors. GradEscape overcomes the undifferentiable computation problem, caused by the discrete nature of text, by introducing a novel approach to construct weighted embeddings for the detector input. It then updates the evader model parameters using feedback from victim detectors, achieving high attack success with minimal text modification. To address the issue of tokenizer mismatch between the evader and the detector, we introduce a warm-started evader method, enabling GradEscape to adapt to detectors across any language model architecture. Moreover, we employ novel tokenizer inference and model extraction techniques, facilitating effective evasion even in query-only access. We evaluate GradEscape on four datasets and three widely-used language models, benchmarking it against four state-of-the-art AIGT evaders. Experimental results demonstrate that GradEscape outperforms existing evaders in various scenarios, including with an 11B paraphrase model, while utilizing only 139M parameters. We have successfully applied GradEscape to two real-world commercial AIGT detectors. Our analysis reveals that the primary vulnerability stems from disparity in text expression styles within the training data. We also propose a potential defense strategy to mitigate the threat of AIGT evaders. We open-source our GradEscape for developing more robust AIGT detectors.

Paper Structure

This paper contains 51 sections, 11 equations, 22 figures, 17 tables.

Figures (22)

  • Figure 1: Attack scenarios.
  • Figure 2: $\mathsf{GradEscape}$ training procedure.
  • Figure 3: Evasion rates versus text quality metrics on GROVER News dataset.
  • Figure 4: Evasion rates versus text quality metrics on HC3 dataset.
  • Figure 5: Opaque model attack results.
  • ...and 17 more figures