Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng; Vinu Sankar Sadasivan; Mehrdad Saberi; Shoumik Saha; Soheil Feizi

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

TL;DR

Large language models enable fluent AI-generated text that can evade detectors, raising plagiarism and social-engineering risks. The authors propose Adversarial Paraphrasing, a training-free, detector-guided paraphrasing framework that uses an instruction-following LLM to rephrase AI text with guidance from a detector. The method behaves like a detector-guided beam search at depth one, selecting each next token to minimize the AI-score, and demonstrates strong transferability across neural, watermark-based, and zero-shot detectors. Experiments show substantial reductions in detection rates across eight detectors and multiple datasets, with average T@1%F reductions around 80–88% and only minor declines in text quality. The work highlights vulnerabilities in current detectors and motivates more robust defenses or adversarially-informed dataset generation for detector training.

Abstract

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

TL;DR

Abstract

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)