RADAR: Robust AI-Text Detection via Adversarial Learning
Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho
TL;DR
RADAR introduces an adversarially trained AI-text detector coupled with a tunable paraphraser to withstand paraphrasing attacks. By iteratively optimizing the paraphraser (via clipped PPO with entropy) and the detector (via reweighted logistic loss), RADAR achieves robust detection across 8 LLMs and 4 datasets, with strong transferability, including under unseen paraphrasers like GPT-3.5-Turbo. The method delivers substantial improvements when facing paraphrase threats (up to ~32% AUROC gains over baselines) and demonstrates resilience across paraphrase variants and GPT-4 transfers, suggesting potential for universal robust AI-text detection. Limitations include occasional degradation on native text and the need for careful, validated deployment due to detection uncertainties and ethical considerations.
Abstract
Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.
