Table of Contents
Fetching ...

AuthorMist: Evading AI Text Detectors with Reinforcement Learning

Isaac David, Arthur Gervais

TL;DR

AuthorMist presents a reinforcement learning framework that treats detector evasion as an optimization problem, using external AI detectors as reward signals in an API-as-reward loop. Built on a 3B-parameter transformer (Qwen2.5-3B Instruct), it leverages Group Relative Policy Optimization (GRPO) with KL regularization to learn paraphrasing policies that preserve meaning while significantly reducing detectability. Across diverse detectors and datasets, AuthorMist achieves high attack success rates and strong semantic fidelity, illustrating notable weaknesses in current AI text detectors and highlighting ethical considerations in detector design and deployment. The work suggests that robust detector development may require focusing on content quality and attribution rather than solely on identifying AI authorship, while signaling important dual-use and governance implications for high-stakes writing contexts.

Abstract

In the age of powerful AI-generated text, automatic detectors have emerged to identify machine-written content. This poses a threat to author privacy and freedom, as text authored with AI assistance may be unfairly flagged. We propose AuthorMist, a novel reinforcement learning-based system to transform AI-generated text into human-like writing. AuthorMist leverages a 3-billion-parameter language model as a backbone, fine-tuned with Group Relative Policy Optimization (GPRO) to paraphrase text in a way that evades AI detectors. Our framework establishes a generic approach where external detector APIs (GPTZero, WinstonAI, Originality.ai, etc.) serve as reward functions within the reinforcement learning loop, enabling the model to systematically learn outputs that these detectors are less likely to classify as AI-generated. This API-as-reward methodology can be applied broadly to optimize text against any detector with an accessible interface. Experiments on multiple datasets and detectors demonstrate that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning. Our evaluation shows attack success rates ranging from 78.6% to 96.2% against individual detectors, significantly outperforming baseline paraphrasing methods. AuthorMist maintains high semantic similarity (above 0.94) with the original text while successfully evading detection. These results highlight limitations in current AI text detection technologies and raise questions about the sustainability of the detection-evasion arms race.

AuthorMist: Evading AI Text Detectors with Reinforcement Learning

TL;DR

AuthorMist presents a reinforcement learning framework that treats detector evasion as an optimization problem, using external AI detectors as reward signals in an API-as-reward loop. Built on a 3B-parameter transformer (Qwen2.5-3B Instruct), it leverages Group Relative Policy Optimization (GRPO) with KL regularization to learn paraphrasing policies that preserve meaning while significantly reducing detectability. Across diverse detectors and datasets, AuthorMist achieves high attack success rates and strong semantic fidelity, illustrating notable weaknesses in current AI text detectors and highlighting ethical considerations in detector design and deployment. The work suggests that robust detector development may require focusing on content quality and attribution rather than solely on identifying AI authorship, while signaling important dual-use and governance implications for high-stakes writing contexts.

Abstract

In the age of powerful AI-generated text, automatic detectors have emerged to identify machine-written content. This poses a threat to author privacy and freedom, as text authored with AI assistance may be unfairly flagged. We propose AuthorMist, a novel reinforcement learning-based system to transform AI-generated text into human-like writing. AuthorMist leverages a 3-billion-parameter language model as a backbone, fine-tuned with Group Relative Policy Optimization (GPRO) to paraphrase text in a way that evades AI detectors. Our framework establishes a generic approach where external detector APIs (GPTZero, WinstonAI, Originality.ai, etc.) serve as reward functions within the reinforcement learning loop, enabling the model to systematically learn outputs that these detectors are less likely to classify as AI-generated. This API-as-reward methodology can be applied broadly to optimize text against any detector with an accessible interface. Experiments on multiple datasets and detectors demonstrate that AuthorMist effectively reduces the detectability of AI-generated text while preserving the original meaning. Our evaluation shows attack success rates ranging from 78.6% to 96.2% against individual detectors, significantly outperforming baseline paraphrasing methods. AuthorMist maintains high semantic similarity (above 0.94) with the original text while successfully evading detection. These results highlight limitations in current AI text detection technologies and raise questions about the sustainability of the detection-evasion arms race.

Paper Structure

This paper contains 51 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: AuthorMist system architecture. The system takes AI-generated text as input and processes it through an RL-optimized paraphrasing model trained to minimize detector scores, producing human-like text that preserves the original meaning while evading detection.
  • Figure 2: ROC curves comparing AuthorMist models against six AI text detectors. Models trained against Originality.ai and Winston.ai show strong cross-detector evasion performance, with curves near or below the diagonal line (random chance). Hello SimpleAI detector shows poor discrimination against our models, while GPTZero and Originality.ai demonstrate greater resilience.
  • Figure 3: AUROC Matrix for AuthorMist GRPO-Trained Bypasser Models, showing the AUROC scores for each model-detector combination.
  • Figure 4: Text Similarity for Qwen2.5-3B GRPO-Trained Bypasser Models. The similarity score ranges from 0 to 1, with 1 indicating identical text.
  • Figure 5: Perplexity Distribution by Detector and Text Type. The violin plots show the distribution of perplexity scores across human-written (green), original AI-generated (orange), and paraphrased (blue) texts for each detector-specific model.