Table of Contents
Fetching ...

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

Yize Cheng, Vinu Sankar Sadasivan, Mehrdad Saberi, Shoumik Saha, Soheil Feizi

TL;DR

Large language models enable fluent AI-generated text that can evade detectors, raising plagiarism and social-engineering risks. The authors propose Adversarial Paraphrasing, a training-free, detector-guided paraphrasing framework that uses an instruction-following LLM to rephrase AI text with guidance from a detector. The method behaves like a detector-guided beam search at depth one, selecting each next token to minimize the AI-score, and demonstrates strong transferability across neural, watermark-based, and zero-shot detectors. Experiments show substantial reductions in detection rates across eight detectors and multiple datasets, with average T@1%F reductions around 80–88% and only minor declines in text quality. The work highlights vulnerabilities in current detectors and motivates more robust defenses or adversarially-informed dataset generation for detector training.

Abstract

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

Adversarial Paraphrasing: A Universal Attack for Humanizing AI-Generated Text

TL;DR

Large language models enable fluent AI-generated text that can evade detectors, raising plagiarism and social-engineering risks. The authors propose Adversarial Paraphrasing, a training-free, detector-guided paraphrasing framework that uses an instruction-following LLM to rephrase AI text with guidance from a detector. The method behaves like a detector-guided beam search at depth one, selecting each next token to minimize the AI-score, and demonstrates strong transferability across neural, watermark-based, and zero-shot detectors. Experiments show substantial reductions in detection rates across eight detectors and multiple datasets, with average T@1%F reductions around 80–88% and only minor declines in text quality. The work highlights vulnerabilities in current detectors and motivates more robust defenses or adversarially-informed dataset generation for detector training.

Abstract

The increasing capabilities of Large Language Models (LLMs) have raised concerns about their misuse in AI-generated plagiarism and social engineering. While various AI-generated text detectors have been proposed to mitigate these risks, many remain vulnerable to simple evasion techniques such as paraphrasing. However, recent detectors have shown greater robustness against such basic attacks. In this work, we introduce Adversarial Paraphrasing, a training-free attack framework that universally humanizes any AI-generated text to evade detection more effectively. Our approach leverages an off-the-shelf instruction-following LLM to paraphrase AI-generated content under the guidance of an AI text detector, producing adversarial examples that are specifically optimized to bypass detection. Extensive experiments show that our attack is both broadly effective and highly transferable across several detection systems. For instance, compared to simple paraphrasing attack--which, ironically, increases the true positive at 1% false positive (T@1%F) by 8.57% on RADAR and 15.03% on Fast-DetectGPT--adversarial paraphrasing, guided by OpenAI-RoBERTa-Large, reduces T@1%F by 64.49% on RADAR and a striking 98.96% on Fast-DetectGPT. Across a diverse set of detectors--including neural network-based, watermark-based, and zero-shot approaches--our attack achieves an average T@1%F reduction of 87.88% under the guidance of OpenAI-RoBERTa-Large. We also analyze the tradeoff between text quality and attack success to find that our method can significantly reduce detection rates, with mostly a slight degradation in text quality. Our adversarial setup highlights the need for more robust and resilient detection strategies in the light of increasingly sophisticated evasion techniques.

Paper Structure

This paper contains 19 sections, 12 figures, 18 tables, 1 algorithm.

Figures (12)

  • Figure 1: An overview of our universal and training-free framework for humanizing AI text. At every auto-regressive step of adversarial paraphrasing, using the guidance from an AI text detector, we search for the token with lowest "AI-score" from the set of top candidate tokens sampled by the paraphraser LLM. The token generation iterations continue until the paraphrasing is finished. (i.e. [EOS] token is sampled)
  • Figure 2: The system prompt used to configure our paraphraser LLM.
  • Figure 3: ROC curves illustrating the AI text detection performance on several deployed detectors, including neural network-based, watermark-based, and zero-shot detectors. The false positive rate (FPR) axis is displayed in log-scale to highlight fine-grained distinctions in the low-FPR region. It can be observed that adversarial paraphrasing consistently and significantly reduces the detection performance across all deployed detectors when compared to the baselines.
  • Figure 4: Relative drop in T@1%F across all combinations of guidance and deployed detectors. The first row corresponds to simple (non-adversarial) paraphrasing baseline krishna2023paraphrasing. On average, simple paraphrasing leads to a 30.27% relative drop in T@1%F. In comparison, adversarial paraphrasing achieves significantly higher reductions---84.94% with MAGE as guidance, 86.89% with RADAR, 80.75% with OpenAI-RoBERTa-Base, and 87.88% with OpenAI-RoBERTa-Large. These results highlight both the universal effectiveness and transferability of our attack.
  • Figure 5: GPT-4o automated text quality evaluations comparing simple and adversarial paraphrases. The top row shows Likert-scale ratings for overall quality and semantic similarity to the original text. Though a slight trade off in text quality can be seen, the error bars show that the difference is not statistically significant. The bottom row presents head-to-head win rates, where in most cases, simple paraphrases outperform adversarial paraphrases less than half of the times.
  • ...and 7 more figures