Table of Contents
Fetching ...

Few-Shot Detection of Machine-Generated Text using Style Representations

Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus Bishop, Nicholas Andrews

TL;DR

The paper addresses the challenge of robustly detecting machine-generated text in the face of evolving large language models. It proposes a few-shot approach that leverages style representations $f$ learned from large human-authored corpora, enabling discrimination between human and machine-written text without requiring samples from the concern models at training time. By training these representations contrastively on Reddit data and deploying them in few-shot and multi-domain settings, the method achieves strong performance against unseen LLMs and across topics, with robustness to paraphrasing when incorporating multiple targets. The work demonstrates practical detection capabilities with high specificity (low false-alarm rates), provides extensive datasets and baselines, and has broad implications for spam, plagiarism, and content moderation while offering a path toward rapid adaptation to new generation models.

Abstract

The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.

Few-Shot Detection of Machine-Generated Text using Style Representations

TL;DR

The paper addresses the challenge of robustly detecting machine-generated text in the face of evolving large language models. It proposes a few-shot approach that leverages style representations learned from large human-authored corpora, enabling discrimination between human and machine-written text without requiring samples from the concern models at training time. By training these representations contrastively on Reddit data and deploying them in few-shot and multi-domain settings, the method achieves strong performance against unseen LLMs and across topics, with robustness to paraphrasing when incorporating multiple targets. The work demonstrates practical detection capabilities with high specificity (low false-alarm rates), provides extensive datasets and baselines, and has broad implications for spam, plagiarism, and content moderation while offering a path toward rapid adaptation to new generation models.

Abstract

The advent of instruction-tuned language models that convincingly mimic human writing poses a significant risk of abuse. However, such abuse may be counteracted with the ability to detect whether a piece of text was composed by a language model rather than a human author. Some previous approaches to this problem have relied on supervised methods by training on corpora of confirmed human- and machine- written documents. Unfortunately, model under-specification poses an unavoidable challenge for neural network-based detectors, making them brittle in the face of data shifts, such as the release of newer language models producing still more fluent text than the models used to train the detectors. Other approaches require access to the models that may have generated a document in question, which is often impractical. In light of these challenges, we pursue a fundamentally different approach not relying on samples from language models of concern at training time. Instead, we propose to leverage representations of writing style estimated from human-authored text. Indeed, we find that features effective at distinguishing among human authors are also effective at distinguishing human from machine authors, including state-of-the-art large language models like Llama-2, ChatGPT, and GPT-4. Furthermore, given a handful of examples composed by each of several specific language models of interest, our approach affords the ability to predict which model generated a given document. The code and data to reproduce our experiments are available at https://github.com/LLNL/LUAR/tree/main/fewshot_iclr2024.
Paper Structure (27 sections, 5 figures, 13 tables)

This paper contains 27 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: UMAP projections mcinnes2018umap of semantic or stylistic representations of writing samples in the Reddit domain composed by human or machine authors. We use SBERT as a representative dense semantic embedding reimers-2019-sentence-bert and UAR as a representative stylistic representation rivera-soto-etal-2021-learning. Each point shown is the result of embedding a document containing at most 128 subword tokens for a standard vocabulary of size 50K. Despite using prompts designed to elicit a variety of writing styles from the LLM, the stylistic representation separates human from machine authors and machine authors from one another significantly better than the semantic representation.
  • Figure 2: Detection performance as the number of documents $N$ comprising episodes varies.
  • Figure 3: Mean of pAUC as the proportion of documents paraphrased in each query varies. Paraphrasing reduces the detection rate across the low FPR range, but including the paraphrased LLM as a support sample (UAR Multi-LLM) mitigates the drop in performance.
  • Figure 4: ROC curves assessing supervised machine-text detection performance. Both the RoBERTa- and UAR-based detectors perform well in-distribution, but performance drops when evaluating on data generated by new LLM, new topics, or new domains. The UAR-based detector is more robust to changes in the testing distribution.
  • Figure 5: pAUC of the proposed approach and the watermark detector as the number of tokens is varied for the Amazon dataset. The proposed approach is more robust to paraphrase attacks, and achieves equal or better results to the watermark detector when the number of tokens is $\geq 48$.