Table of Contents
Fetching ...

Learning to Rewrite: Generalized LLM-Generated Text Detection

Ran Li, Wei Hao, Weiliang Zhao, Junfeng Yang, Chengzhi Mao

TL;DR

Learning2Rewrite (L2R) reframes AI-generated text detection as a rewrite-distance problem, training a rewrite model to edit human-written content more than AI-generated content to create a stable, domain-agnostic decision boundary. By combining a differentiable proxy loss with a calibration mechanism, L2R achieves strong generalization across 21 domains and multiple LLMs, outperforming state-of-the-art detectors in ID, OOD, and adversarial settings. The approach is reinforced by a diversely sourced dataset and prompt variations, demonstrating robustness to distributional shifts and attacks while offering interpretability through highlighted rewrites. The work suggests reinforcing LLM rewriting tendencies as a scalable, practical solution for reliable AI-generated text detection in real-world deployment.

Abstract

Large language models (LLMs) present significant risks when used to generate non-factual content and spread disinformation at scale. Detecting such LLM-generated content is crucial, yet current detectors often struggle to generalize in open-world contexts. We introduce Learning2Rewrite, a novel framework for detecting AI-generated text with exceptional generalization to unseen domains. Our method leverages the insight that LLMs inherently modify AI-generated content less than human-written text when tasked with rewriting. By training LLMs to minimize alterations on AI-generated inputs, we amplify this disparity, yielding a more distinguishable and generalizable edit distance across diverse text distributions. Extensive experiments on data from 21 independent domains and four major LLMs (GPT-3.5, GPT-4, Gemini, and Llama-3) demonstrate that our detector outperforms state-of-the-art detection methods by up to 23.04% in AUROC for in-distribution tests, 37.26% for out-of-distribution tests, and 48.66% under adversarial attacks. Our unique training objective ensures better generalizability compared to directly training for classification, when leveraging the same amount of parameters. Our findings suggest that reinforcing LLMs' inherent rewriting tendencies offers a robust and scalable solution for detecting AI-generated text.

Learning to Rewrite: Generalized LLM-Generated Text Detection

TL;DR

Learning2Rewrite (L2R) reframes AI-generated text detection as a rewrite-distance problem, training a rewrite model to edit human-written content more than AI-generated content to create a stable, domain-agnostic decision boundary. By combining a differentiable proxy loss with a calibration mechanism, L2R achieves strong generalization across 21 domains and multiple LLMs, outperforming state-of-the-art detectors in ID, OOD, and adversarial settings. The approach is reinforced by a diversely sourced dataset and prompt variations, demonstrating robustness to distributional shifts and attacks while offering interpretability through highlighted rewrites. The work suggests reinforcing LLM rewriting tendencies as a scalable, practical solution for reliable AI-generated text detection in real-world deployment.

Abstract

Large language models (LLMs) present significant risks when used to generate non-factual content and spread disinformation at scale. Detecting such LLM-generated content is crucial, yet current detectors often struggle to generalize in open-world contexts. We introduce Learning2Rewrite, a novel framework for detecting AI-generated text with exceptional generalization to unseen domains. Our method leverages the insight that LLMs inherently modify AI-generated content less than human-written text when tasked with rewriting. By training LLMs to minimize alterations on AI-generated inputs, we amplify this disparity, yielding a more distinguishable and generalizable edit distance across diverse text distributions. Extensive experiments on data from 21 independent domains and four major LLMs (GPT-3.5, GPT-4, Gemini, and Llama-3) demonstrate that our detector outperforms state-of-the-art detection methods by up to 23.04% in AUROC for in-distribution tests, 37.26% for out-of-distribution tests, and 48.66% under adversarial attacks. Our unique training objective ensures better generalizability compared to directly training for classification, when leveraging the same amount of parameters. Our findings suggest that reinforcing LLMs' inherent rewriting tendencies offers a robust and scalable solution for detecting AI-generated text.
Paper Structure (30 sections, 3 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 3 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Rewriting for LLM Text Detection. The histograms depict the edit distance distributions for texts generated by human and AI, illustrating how fine-tuning a rewrite model enhances their separation. We show two domains: Purple and Yellow represent human and AI distributions for Product Review texts, while Blue and Orange represent those for Environmental texts. Without fine-tuning the rewrite model, human and AI distributions are inseparable by a single threshold (red line, above). After fine-tuning, the texts can be separated by this threshold (below). On the right, we conceptualize L2R's intuition by showing that the rugged decision boundary between human and AI texts, caused by varying data distributions across domains, can be better aligned and divided by a single threshold after fine-tuning. Specifically, the standard deviation in decision thresholds among all domains decreases from 0.7506 to 0.4428 after fine-tuning.
  • Figure 2: Overview. Deleted characters are marked in red, added characters are marked in blue, and unmodified characters are in black. We exploit the difference in rewriting distance between human and AI texts for classification. While the off-the-shelf Llama-3 model give different amount of rewrite for human and AI texts (above), rewrites from our fine-tuned model are much more separable (below).
  • Figure 3: Relationship between the number of trainable parameters and ID and OOD AUROC scores for L2R and RAIDAR. As the number of parameters increase from $1\times10^6$ to $7\times10^6$, both L2R and RAIDAR show higher ID performance and lower OOD performance, showing how the effect of overfitting emerges as we increase the LLM's trainable parameters. L2R outperforms Llama Logits either OOD or both ID and OOD, showing the superior robustness and accuracy of L2R.
  • Figure 4: Training loss curves for the rewrite model. The orange plots the loss trained without the calibration method, and the blue line plots the loss trained with the method. The later one exhibits faster convergence and higher stability than the former one.
  • Figure 5: Examples of texts in our proposed dataset along with the amount of edits L2R model gives for human and LLM data. Deleted characters are marked in red, inserted characters are in blue, and unmodified characters are in black. The examples demonstrate the diverse domains and source LLMs available in the dataset, as well as L2R's ability in separating human and LLM texts via rewriting.