StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Suraj Ranganath; Atharv Ramesh

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Suraj Ranganath, Atharv Ramesh

TL;DR

StealthRL presents a reinforcement-learning framework to stress-test AI-text detectors against adaptive paraphrase attacks at realistic operating points (1% FPR). By training a paraphrase policy via Group Relative Policy Optimization with LoRA on Qwen3-4B-Instruct, and evaluating against a three-detector ensemble with held-out transfer to Binoculars, the approach reveals severe robustness gaps and cross-architecture vulnerabilities. The study offers a comprehensive evaluation including detector-score analyses, LLM-based quality judgments, and bootstrap-supported AUROC/TPR metrics, highlighting the fragility of current detectors to surface-level cues. The work provides a principled adversarial evaluation protocol and public code to accelerate robustness research, with implications for defense development and safer deployment of AI-text detection systems.

Abstract

AI-text detectors face a critical robustness challenge: adversarial paraphrasing attacks that preserve semantics while evading detection. We introduce StealthRL, a reinforcement learning framework that stress-tests detector robustness under realistic adversarial conditions. StealthRL trains a paraphrase policy against a multi-detector ensemble using Group Relative Policy Optimization (GRPO) with LoRA adapters on Qwen3-4B, optimizing a composite reward that balances detector evasion with semantic preservation. We evaluate six attack settings (M0-M5) against three detector families (RoBERTa, FastDetectGPT, and Binoculars) at the security-relevant 1% false positive rate operating point. StealthRL achieves near-zero detection (0.001 mean TPR@1%FPR), reduces mean AUROC from 0.74 to 0.27, and attains a 99.9% attack success rate. Critically, attacks transfer to a held-out detector family not seen during training, revealing shared architectural vulnerabilities rather than detector-specific brittleness. We additionally conduct LLM-based quality evaluation via Likert scoring, analyze detector score distributions to explain why evasion succeeds, and provide per-detector AUROC with bootstrap confidence intervals. Our results expose significant robustness gaps in current AI-text detection and establish StealthRL as a principled adversarial evaluation protocol. Code and evaluation pipeline are publicly available at https://github.com/suraj-ranganath/StealthRL.

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 38 sections, 3 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Work
AI-Text Detection Methods
Adversarial Attacks on Detectors
Reinforcement Learning for Text Generation
Method
Threat Model
Reward Design
Detector evasion reward.
Semantic similarity reward.
KL penalty.
Training Pipeline
Inference
Experimental Setup
Dataset
...and 23 more sections

Figures (7)

Figure 1: StealthRL training and evaluation pipeline. A paraphrase policy (Qwen3-4B with LoRA) is trained via GRPO against a two-detector ensemble (RoBERTa + Fast-DetectGPT) with semantic similarity reward. The trained policy is then evaluated against all three detector families, including the held-out Binoculars, at the 1% FPR operating point.
Figure 2: Detection evasion results for methods M0--M5. (a) AUROC by detector. (b) Mean AUROC with confidence intervals. (c) TPR at 1% FPR. (d) Mean attack success rate. StealthRL (M2, teal) achieves below-random AUROC on Fast-DetectGPT and Binoculars and near-zero TPR across all detectors.
Figure 3: TPR@1%FPR heatmap across detectors and methods. Darker colors indicate higher detection rates. StealthRL (M2) and Homoglyph (M5) achieve near-zero TPR across all three detector families, including the held-out Binoculars.
Figure 4: Detector score distributions for AI samples across methods (one panel per detector). StealthRL (M2) and Homoglyph (M5) push scores below the detection threshold, explaining their near-zero TPR@1%FPR.
Figure 5: Per-detector AUROC with 95% bootstrap confidence intervals. StealthRL (M2, teal) achieves below-random AUROC on Fast-DetectGPT (0.071) and Binoculars (0.041), with substantial reduction on RoBERTa (0.693). The dashed line marks the 0.5 random-chance baseline.
...and 2 more figures

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

TL;DR

Abstract

StealthRL: Reinforcement Learning Paraphrase Attacks for Multi-Detector Evasion of AI-Text Detectors

Authors

TL;DR

Abstract

Table of Contents

Figures (7)