Table of Contents
Fetching ...

RADAR: Robust AI-Text Detection via Adversarial Learning

Xiaomeng Hu, Pin-Yu Chen, Tsung-Yi Ho

TL;DR

RADAR introduces an adversarially trained AI-text detector coupled with a tunable paraphraser to withstand paraphrasing attacks. By iteratively optimizing the paraphraser (via clipped PPO with entropy) and the detector (via reweighted logistic loss), RADAR achieves robust detection across 8 LLMs and 4 datasets, with strong transferability, including under unseen paraphrasers like GPT-3.5-Turbo. The method delivers substantial improvements when facing paraphrase threats (up to ~32% AUROC gains over baselines) and demonstrates resilience across paraphrase variants and GPT-4 transfers, suggesting potential for universal robust AI-text detection. Limitations include occasional degradation on native text and the need for careful, validated deployment due to detection uncertainties and ethical considerations.

Abstract

Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.

RADAR: Robust AI-Text Detection via Adversarial Learning

TL;DR

RADAR introduces an adversarially trained AI-text detector coupled with a tunable paraphraser to withstand paraphrasing attacks. By iteratively optimizing the paraphraser (via clipped PPO with entropy) and the detector (via reweighted logistic loss), RADAR achieves robust detection across 8 LLMs and 4 datasets, with strong transferability, including under unseen paraphrasers like GPT-3.5-Turbo. The method delivers substantial improvements when facing paraphrase threats (up to ~32% AUROC gains over baselines) and demonstrates resilience across paraphrase variants and GPT-4 transfers, suggesting potential for universal robust AI-text detection. Limitations include occasional degradation on native text and the need for careful, validated deployment due to detection uncertainties and ethical considerations.

Abstract

Recent advances in large language models (LLMs) and the intensifying popularity of ChatGPT-like applications have blurred the boundary of high-quality text generation between humans and machines. However, in addition to the anticipated revolutionary changes to our technology and society, the difficulty of distinguishing LLM-generated texts (AI-text) from human-generated texts poses new challenges of misuse and fairness, such as fake content generation, plagiarism, and false accusations of innocent writers. While existing works show that current AI-text detectors are not robust to LLM-based paraphrasing, this paper aims to bridge this gap by proposing a new framework called RADAR, which jointly trains a robust AI-text detector via adversarial learning. RADAR is based on adversarial training of a paraphraser and a detector. The paraphraser's goal is to generate realistic content to evade AI-text detection. RADAR uses the feedback from the detector to update the paraphraser, and vice versa. Evaluated with 8 different LLMs (Pythia, Dolly 2.0, Palmyra, Camel, GPT-J, Dolly 1.0, LLaMA, and Vicuna) across 4 datasets, experimental results show that RADAR significantly outperforms existing AI-text detection methods, especially when paraphrasing is in place. We also identify the strong transferability of RADAR from instruction-tuned LLMs to other LLMs, and evaluate the improved capability of RADAR via GPT-3.5-Turbo.
Paper Structure (27 sections, 9 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 27 sections, 9 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of RADAR. An AI-text corpus is first generated from a target (frozen) language model from a human-text corpus. In RADAR, we introduce a paraphraser (a tunable language model) and a detector (a separate tunable language model). In the training stage, the detector aims to discern human-text v.s. AI-text, while the paraphraser aims to rewrite AI-text to evade detection. The model parameters of the paraphraser and the detector are updated in an adversarial learning manner as described in Section \ref{['sec:methods']}. In the evaluation stage, the trained detector is deployed to predict the likelihood of AI-generated content for any input instance.
  • Figure 3: RADAR's detection transferability between pairs of 8 LLMs in Table \ref{['tab:llms']}. In the matrix, each row is the source LLM (model A) for training the detector, and each column is the target LLM (model B) for evaluation. The reported value in the matrix represents the detection transferability from A to B. A larger value indicates better transferability. The bar chart shows the row-wise sum of the matrix, indicating the holistic transferability of each source LLM.
  • Figure 4: Detection AUROC of RADAR against multiple paraphrasing. The experiments are conducted on Xsum using the detector trained for Camel-5B.
  • Figure 5: Evaluation of RADAR's paraphraser versus its initial version (T5-large).
  • Figure A1: Visulization of the training process of RADAR targeting Camel-5B.
  • ...and 4 more figures