Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

Lekkala Sai Teja; Annepaka Yadagiri; Sangam Sai Anish; Siva Gopala Krishna Nuthakki; Partha Pakray

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

Lekkala Sai Teja, Annepaka Yadagiri, Sangam Sai Anish, Siva Gopala Krishna Nuthakki, Partha Pakray

TL;DR

This work tackles robust detection of AI-generated text under adversarial paraphrasing. It introduces Perturbation-Invariant Feature Engineering (PIFE), which canonicalizes input text and uses a discrepancy vector of features (e.g., semantic similarity, Levenshtein distance, n-gram metrics) to explicitly model perturbation artifacts, enabling a transformer detector to resist semantic evasion. Empirical results show conventional adversarial training fails against semantic attacks, while the PIFE-augmented detector sustains high performance across character, word, and sentence perturbations, outperforming zero-shot detectors in in-domain data and offering stronger robustness to paraphrase-based evasion. The findings highlight the value of explicit perturbation modeling for practical, robust AI-text detection, and point to hybrid approaches and broader generalization studies as promising future directions.

Abstract

The growth of highly advanced Large Language Models (LLMs) constitutes a huge dual-use problem, making it necessary to create dependable AI-generated text detection systems. Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique that foils statistical detection. This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training and then by introducing a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering (PIFE), a framework that enhances detection by first transforming input text into a standardized form using a multi-stage normalization pipeline, it then quantifies the transformation's magnitude using metrics like Levenshtein distance and semantic similarity, feeding these signals directly to the classifier. We evaluate both a conventionally hardened Transformer and our PIFE-augmented model against a hierarchical taxonomy of character-, word-, and sentence-level attacks. Our findings first confirm that conventional adversarial training, while resilient to syntactic noise, fails against semantic attacks, an effect we term "semantic evasion threshold", where its True Positive Rate at a strict 1% False Positive Rate plummets to 48.8%. In stark contrast, our PIFE model, which explicitly engineers features from the discrepancy between a text and its canonical form, overcomes this limitation. It maintains a remarkable 82.6% TPR under the same conditions, effectively neutralizing the most sophisticated semantic attacks. This superior performance demonstrates that explicitly modeling perturbation artifacts, rather than merely training on them, is a more promising path toward achieving genuine robustness in the adversarial arms race.

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

TL;DR

Abstract

Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)