Table of Contents
Fetching ...

Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings

Ahmed K. Kadhim, Lei Jiao, Rishad Shafik, Ole-Christoffer Granmo

TL;DR

This work investigates adversarial attacks on AI-generated text detectors by reengineering token probabilities through embedding- and synonym-based substitutions. It introduces a three-pronged framework (embedding similarity, synonym similarity, and a hybrid scheme) implemented with a transparent Tsetlin Machine to produce interpretable adversarial steps. Empirical results show substantial reductions in AUROC for multiple detectors (notably Fast-DetectGPT) on XSum and SQuAD, with Word2Vec and TM-AE delivering particularly strong attacks and hybrid methods achieving the lowest detection scores. The study highlights vulnerabilities in current detection systems and argues for robust defenses that fuse embedding-derived probability signals with traditional detection features to resist adversarial manipulation in practical settings.

Abstract

In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.

Adversarial Attacks on AI-Generated Text Detection Models: A Token Probability-Based Approach Using Embeddings

TL;DR

This work investigates adversarial attacks on AI-generated text detectors by reengineering token probabilities through embedding- and synonym-based substitutions. It introduces a three-pronged framework (embedding similarity, synonym similarity, and a hybrid scheme) implemented with a transparent Tsetlin Machine to produce interpretable adversarial steps. Empirical results show substantial reductions in AUROC for multiple detectors (notably Fast-DetectGPT) on XSum and SQuAD, with Word2Vec and TM-AE delivering particularly strong attacks and hybrid methods achieving the lowest detection scores. The study highlights vulnerabilities in current detection systems and argues for robust defenses that fuse embedding-derived probability signals with traditional detection features to resist adversarial manipulation in practical settings.

Abstract

In recent years, text generation tools utilizing Artificial Intelligence (AI) have occasionally been misused across various domains, such as generating student reports or creative writings. This issue prompts plagiarism detection services to enhance their capabilities in identifying AI-generated content. Adversarial attacks are often used to test the robustness of AI-text generated detectors. This work proposes a novel textual adversarial attack on the detection models such as Fast-DetectGPT. The method employs embedding models for data perturbation, aiming at reconstructing the AI generated texts to reduce the likelihood of detection of the true origin of the texts. Specifically, we employ different embedding techniques, including the Tsetlin Machine (TM), an interpretable approach in machine learning for this purpose. By combining synonyms and embedding similarity vectors, we demonstrates the state-of-the-art reduction in detection scores against Fast-DetectGPT. Particularly, in the XSum dataset, the detection score decreased from 0.4431 to 0.2744 AUROC, and in the SQuAD dataset, it dropped from 0.5068 to 0.3532 AUROC.

Paper Structure

This paper contains 20 sections, 2 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Proposed Adversarial Attack Framework. This figure illustrates the proposed design of an adversarial attack, where the input text (e.g., Doc1, Doc2, ...) is perturbed by selecting alternative tokens with low probability scores generated by embedding models, with the goal of misleading the detection model.
  • Figure 2: Clause formation in TM. The $y$-axis is the state index while the $x$-axis is the literal index. When the state of an automaton is above $N$, its corresponding literal is included in the clause. Before training, the states of the automata are configured as $N$ (the yellow dots in the left figure). During training, the states are updated (move up and down shown in the middle figure) based on the learning mechanism and training samples. Once trained, the clause is expressed by ANDing the included literals (the green dots in the right figure) and ignoring the excluded literals (red dots).
  • Figure 3: Heatmap illustrating the AUROC scores across various AI-text detection methods and embedding models. The x-axis represents the detection methods, while the y-axis corresponds to embedding models.
  • Figure 4: Impact of disturbance percentage and perturbation threshold on detection accuracy. The left panel shows the effect of increasing the percentage of replaced words, while the right panel illustrates the influence of synonym proximity (min, mid, high).
  • Figure 5: The impact of hybrid substitutions on detection accuracy across five PLMs, comparing all detection methods (left) and the Fast-DetectGPT method (right). Average detection scores for all methods fell to values between 0.1 and 0.2. For XSum, detection scores ranged from 0.1282 to 0.1467, while scores for SQuAD ranged from 0.1438 to 0.1814 (see blue and red numbers above dataset bars). The Fast-DetectGPT results (right) demonstrated average detection scores ranging between 0.2 and 0.5. Specifically, detection scores for XSum varied from 0.2744 to 0.3984, while scores for SQuAD ranged from 0.3532 to 0.5030.