Table of Contents
Fetching ...

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Suraj Ranganath, Atharv Ramesh

TL;DR

StealthRL presents a reinforcement-learning framework to stress-test AI-text detectors against adaptive paraphrase attacks at realistic operating points (1% FPR). By training a paraphrase policy via Group Relative Policy Optimization with LoRA on Qwen3-4B-Instruct, and evaluating against a three-detector ensemble with held-out transfer to Binoculars, the approach reveals severe robustness gaps and cross-architecture vulnerabilities. The study offers a comprehensive evaluation including detector-score analyses, LLM-based quality judgments, and bootstrap-supported AUROC/TPR metrics, highlighting the fragility of current detectors to surface-level cues. The work provides a principled adversarial evaluation protocol and public code to accelerate robustness research, with implications for defense development and safer deployment of AI-text detection systems.

Abstract

AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

TL;DR

StealthRL presents a reinforcement-learning framework to stress-test AI-text detectors against adaptive paraphrase attacks at realistic operating points (1% FPR). By training a paraphrase policy via Group Relative Policy Optimization with LoRA on Qwen3-4B-Instruct, and evaluating against a three-detector ensemble with held-out transfer to Binoculars, the approach reveals severe robustness gaps and cross-architecture vulnerabilities. The study offers a comprehensive evaluation including detector-score analyses, LLM-based quality judgments, and bootstrap-supported AUROC/TPR metrics, highlighting the fragility of current detectors to surface-level cues. The work provides a principled adversarial evaluation protocol and public code to accelerate robustness research, with implications for defense development and safer deployment of AI-text detection systems.

Abstract

AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.
Paper Structure (38 sections, 3 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 38 sections, 3 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: StealthRL training and evaluation pipeline. A paraphrase policy (Qwen3-4B with LoRA) is trained via GRPO against a two-detector ensemble (RoBERTa + Fast-DetectGPT) with semantic similarity reward. The trained policy is then evaluated against all three detector families, including the held-out Binoculars, at the 1% FPR operating point.
  • Figure 2: Detection evasion results for methods M0--M5. (a) AUROC by detector. (b) Mean AUROC with confidence intervals. (c) TPR at 1% FPR. (d) Mean attack success rate. StealthRL (M2, teal) achieves below-random AUROC on Fast-DetectGPT and Binoculars and near-zero TPR across all detectors.
  • Figure 3: TPR@1%FPR heatmap across detectors and methods. Darker colors indicate higher detection rates. StealthRL (M2) and Homoglyph (M5) achieve near-zero TPR across all three detector families, including the held-out Binoculars.
  • Figure 4: Detector score distributions for AI samples across methods (one panel per detector). StealthRL (M2) and Homoglyph (M5) push scores below the detection threshold, explaining their near-zero TPR@1%FPR.
  • Figure 5: Per-detector AUROC with 95% bootstrap confidence intervals. StealthRL (M2, teal) achieves below-random AUROC on Fast-DetectGPT (0.071) and Binoculars (0.041), with substantial reduction on RoBERTa (0.693). The dashed line marks the 0.5 random-chance baseline.
  • ...and 2 more figures