Table of Contents
Fetching ...

On the Possibilities of AI-Generated Text Detection

Souradip Chakraborty, Amrit Singh Bedi, Sicheng Zhu, Bang An, Dinesh Manocha, Furong Huang

TL;DR

The paper tackles the problem of distinguishing AI-generated text from human text using an information-theoretic lens. It shows that, except when human and machine text distributions are indistinguishable, detection is feasible by collecting multiple samples, with the best performance achieved by likelihood-ratio detectors and AUROC increasing exponentially with sample size via Chernoff information. It provides explicit iid and non-IID sample-complexity bounds, and validates the theory with experiments on multiple datasets and generation/detector pairs. The results support practical multi-sample detectors as a robust tool to mitigate misuse of LLMs, while acknowledging challenges from paraphrasing and distributional proximity. Overall, the work lays a theoretical and empirical foundation for multi-sample AI-generated text detection and informs detector and watermark design for real-world deployment.

Abstract

Our work addresses the critical issue of distinguishing text generated by Large Language Models (LLMs) from human-produced text, a task essential for numerous applications. Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. We establish precise sample complexity bounds for detecting AI-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. Our empirical evaluations across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability of enhanced detection methods. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI's empirical data related to sequence length, marking the first theoretical substantiation for these observations.

On the Possibilities of AI-Generated Text Detection

TL;DR

The paper tackles the problem of distinguishing AI-generated text from human text using an information-theoretic lens. It shows that, except when human and machine text distributions are indistinguishable, detection is feasible by collecting multiple samples, with the best performance achieved by likelihood-ratio detectors and AUROC increasing exponentially with sample size via Chernoff information. It provides explicit iid and non-IID sample-complexity bounds, and validates the theory with experiments on multiple datasets and generation/detector pairs. The results support practical multi-sample detectors as a robust tool to mitigate misuse of LLMs, while acknowledging challenges from paraphrasing and distributional proximity. Overall, the work lays a theoretical and empirical foundation for multi-sample AI-generated text detection and informs detector and watermark design for real-world deployment.

Abstract

Our work addresses the critical issue of distinguishing text generated by Large Language Models (LLMs) from human-produced text, a task essential for numerous applications. Despite ongoing debate about the feasibility of such differentiation, we present evidence supporting its consistent achievability, except when human and machine text distributions are indistinguishable across their entire support. Drawing from information theory, we argue that as machine-generated text approximates human-like quality, the sample size needed for detection increases. We establish precise sample complexity bounds for detecting AI-generated text, laying groundwork for future research aimed at developing advanced, multi-sample detectors. Our empirical evaluations across multiple datasets (Xsum, Squad, IMDb, and Kaggle FakeNews) confirm the viability of enhanced detection methods. We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero. Our findings align with OpenAI's empirical data related to sequence length, marking the first theoretical substantiation for these observations.
Paper Structure (21 sections, 5 theorems, 46 equations, 10 figures)

This paper contains 21 sections, 5 theorems, 46 equations, 10 figures.

Key Result

Proposition 1

For any detector $D$, with a given collection of i.i.d. samples $S:=\{s_i\}_{i=1}^n$ either from human $h(s)$ or machine $m(s)$, it holds that where $\texttt{TV}(m^{\otimes n}, h^{\otimes n}) := 1 -\exp\left(-n I_{c} (m, h) + o(n)\right)$ and $I_{c} (m, h)$ is the Chernoff information. Therefore, the upper bound of AUROC increases exponentially with respect to the number of samples $n$.

Figures (10)

  • Figure 1: In light of the sample complexity bound presented in Theorem \ref{['sample_complexity']}, we show here pictorially how increasing the number of samples $n$ used for detection would affect the ROC of the best possible detector, which is achieved by the likelihood-ratio-based classifier. We note that in the ROC curve on the left for $\texttt{TV}(m,h)=0.1$, the AUROC of the best possible detector will be $0.6$ as derived in sadasivan2023can (shown by an orange dot in right figure). The AUROC of $0.6$ would lead to the conclusion that detection is hard. In contrast, we note that by increasing the number of samples $n$, the ROC upper bound starts increasing towards $1$ exponentially fast (shown by the shaded blue region in the left figure for different values of $n$), and hence the AUROC of the best possible detector also starts increasing as shown by corresponding blue dots in the right figure. This ensures that the detection should be possible even in hard scenarios when $\texttt{TV}(m,h)$ norm is small.
  • Figure 2: (a)-(c) validates our theorem for real human-machine classification datasets generated with XSum xsum_dataset and Squad squad_dataset, showing that with an increase in the number of samples/sequence length, detection performance improves significantly. Figure \ref{['fig:ngram']} shows that the AUROC achieved by the best possible detector using the equation increases significantly from $58\%$ to $97\%$ with an increase in the Ngrams of the feature space for both Xsum and Squad datasets. Figure \ref{['fig:xsum_real']} demonstrates the improvement in AUROC with respect to sequence length using various real detectors/classifiers. Figure \ref{['fig:cont']} shows using a box-plot-based comparison that if we consider $2$ iid sequences (from either machine/human) to detect instead of one, the AUROC of the real detector improves drastically from $73\%$ to $97\%$, hence validating our hypothesis.
  • Figure 3: (a)-(f) validates our theorem for real human-machine classification datasets generated with XSum & Squad, with zero-shot detection performance. We use different generator/detector pairs to show the performance comparisons. For instance, (a) shows the detection performance (AUROC) of OpenAI’s Roberta detector (Large) on the text generated by GPT3.5 Turbo, and we extend it to other pairs in (b)-(f). We observe that with the increase in the number of samples or sequence length for detection, the zero-shot detection performance from both the models improves from around 50% to 90% for both Xsum and Squad human-machine datasets. We also performed similar experiments with GPT-2 as well and results are available in Figure \ref{['additional']} in the appendix.
  • Figure 4: This figure demonstrates zero-shot detection performance with and without paraphrasing using RoBERTa-Large-Detector. Although the detection performance drops by approximately 15% due to paraphrasing, the trend of performance improvement holds as the sequence length increases.
  • Figure 5: We present the two detectability regimes for LLMs. Figure \ref{['fig:example']}(a) denotes the scenario in which, when LLMs learn a different distribution, and the detection is easy. Figure \ref{['fig:example']}(b) shows a scenario when LLMs' distribution is very close to human's, it is hard but possible to detect in this setting via collecting more samples. Additionally in scenarios of Figure \ref{['fig:example']}(b), efficient watermarking techniques such as kirchenbauer2023watermarkkrishna2023paraphrasing could help in improving the separability and detectability.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Proposition 1: Area Under ROC Curve
  • Theorem 1: Sample Complexity of AI-generated Text Detection (Possibility Result)
  • Theorem 2: Sample Complexity of AI-generated Text Detection (non-iid)
  • Lemma 1: Le Cam's Lemma
  • proof
  • Lemma 2: Upper Bound for Non-iid scenario