An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

Idan Kashtan; Alon Kipnis

An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

Idan Kashtan, Alon Kipnis

TL;DR

This work addresses the challenge of detecting edits in AI-generated text when edits are sparse and their locations unknown. It introduces a two-step method that first tests each sentence with a log-perplexity-based statistic and then aggregates evidence globally using Higher Criticism to detect deviations from purely GLM authorship; it may also identify candidate edited sentences. The paper provides an information-theoretic analysis of the method’s optimality under sparse mixture models and cross-entropy principles, supported by extensive empirical evaluations on synthetic and realistic datasets, including Wikipedia-style and news articles. It concludes with practical refinements, context-based extensions, and theoretical questions about optimal detector design and the minimal edits required for reliable detection, highlighting the approach’s significance for trust, transparency, and governance of GLM-generated content.

Abstract

We propose a method to determine whether a given article was written entirely by a generative language model or perhaps contains edits by a different author, possibly a human. Our process involves multiple tests for the origin of individual sentences or other pieces of text and combining these tests using a method that is sensitive to rare alternatives, i.e., non-null effects are few and scattered across the text in unknown locations. Interestingly, this method also identifies pieces of text suspected to contain edits. We demonstrate the effectiveness of the method in detecting edits through extensive evaluations using real data and provide an information-theoretic analysis of the factors affecting its success. In particular, we discuss optimality properties under a theoretical framework for text editing saying that sentences are generated mainly by the language model, except perhaps for a few sentences that might have originated via a different mechanism. Our analysis raises several interesting research questions at the intersection of information theory and data science.

An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

TL;DR

Abstract

Paper Structure (27 sections, 21 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 21 equations, 8 figures, 2 tables, 1 algorithm.

Introduction
Background and Motivation
Existing Approaches
Our Approach
Contributions and Paper Organization
Method Description
Testing individual sentences
Global Testing using Higher Criticism (HC)
Identifying edited sentences
Refinements and Generalizations
Adjusting the log-perplexity distribution for sentence's length
Unusually short and long sentences
Generalizing Step I: Testing pieces of text individually
Generalizing Step II: Inference based on multiple testing
Empirical Results
...and 12 more sections

Figures (8)

Figure 1: Left: The GLM ChatGPT is sequentially prompted to generate sections of a Wikipedia-style article titled Welsh Corgi. Right: The composition of the generated text with section titles leads to a so-called GLM-written article. The human editor alters the article in some places. We are interested in detecting the presence of edits if they exist, and their locations.
Figure 2: The detection procedure applied to the example article Welsh Corgy written by the GLM GPT3.5-turbo. Left (table): log-perplexity and its P-value, for each sentence. Actual non-GLM sentences are in blue. Right: The Higher Criticism (HC) score is compared to a threshold, e.g. the $0.95$ quantile of HC under independent and uniformly distributed P-values, or a threshold calibrated via training data. Here log-perplexity is under the language model GPT2 and the P-values are based on the empirical distribution of sentences from Wikipedia-style articles written by the same GLM.
Figure 3: Discriminating GLM from non-GLM sentences using the log-perplexity (LPPT) statistic \ref{['eq:LL_def']}. Left: histogram by class of LPPT of sentences from the dataset News Articlesisarth_2023 (top) and Wikipedia Introductionsaaditya_bhat_2023 (bottom). Right: the receiver operating characteristic (ROC) of a test based on the LPPT. The area under the ROC curve (AUC) is indicated. In both cases, LPPT is under the language model GPT2 (1.5B).
Figure 4: Simulated critical values for a test of significance level $\alpha$ based on Higher Criticism of $n$ independent P-values. The number of samples in each configuration is $10,000$. Bootstrapped $0.95$ confidence intervals are indicated.
Figure 5: Adjusting the perplexity test for the number of tokens in a sentence. Left: averaged log-perplexity versus sentence length. The shaded area indicates 2 standard errors. Right: fitted log-perplexity survival functions of GPT2 for several lengths. Based on 20,000 samples from the dataset Wikipedia Introductionsaaditya_bhat_2023.
...and 3 more figures

An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

TL;DR

Abstract

An Information-Theoretic Approach for Detecting Edits in AI-Generated Text

Authors

TL;DR

Abstract

Table of Contents

Figures (8)