Table of Contents
Fetching ...

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

Liyan Xie, Muhammad Siddeek, Mohamed Seif, Andrea J. Goldsmith, Mengdi Wang

TL;DR

The paper tackles the problem of identifying post-generation edits to watermarked LLM outputs by introducing a combinatorial pattern-based watermarking scheme that partitions the vocabulary into subsets and deterministically enforces a cyclic pattern during generation. It provides both a global watermark detection statistic and lightweight local statistics to localize edits, formalizing the problem with clear metrics and evaluating on open-source LLMs across diverse edits. Key contributions include the formal problem definition, a concrete pattern-based watermarking framework (with AB and ACADBD patterns), analytical bounds for edit-detection false alarms, and extensive empirical validation showing strong local edit localization and competitive watermark detectability. The approach improves accountability and traceability in AI-generated content by enabling token-level localization of edits while maintaining reliable watermark verification, with practical implications for collaboration, publishing, and security contexts.

Abstract

Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.

Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking

TL;DR

The paper tackles the problem of identifying post-generation edits to watermarked LLM outputs by introducing a combinatorial pattern-based watermarking scheme that partitions the vocabulary into subsets and deterministically enforces a cyclic pattern during generation. It provides both a global watermark detection statistic and lightweight local statistics to localize edits, formalizing the problem with clear metrics and evaluating on open-source LLMs across diverse edits. Key contributions include the formal problem definition, a concrete pattern-based watermarking framework (with AB and ACADBD patterns), analytical bounds for edit-detection false alarms, and extensive empirical validation showing strong local edit localization and competitive watermark detectability. The approach improves accountability and traceability in AI-generated content by enabling token-level localization of edits while maintaining reliable watermark verification, with practical implications for collaboration, publishing, and security contexts.

Abstract

Watermarking has become a key technique for proprietary language models, enabling the distinction between AI-generated and human-written text. However, in many real-world scenarios, LLM-generated content may undergo post-generation edits, such as human revisions or even spoofing attacks, making it critical to detect and localize such modifications. In this work, we introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs. To this end, we propose a combinatorial pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark by enforcing a deterministic combinatorial pattern over these subsets during generation. We accompany the combinatorial watermark with a global statistic that can be used to detect the watermark. Furthermore, we design lightweight local statistics to flag and localize potential edits. We introduce two task-specific evaluation metrics, Type-I error rate and detection accuracy, and evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.

Paper Structure

This paper contains 25 sections, 2 theorems, 23 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Theorem 3.1

Assume that under a clean watermark, the pattern alignment probability for each window of size $w$ is $\mu_1^{(w)} := \mathbb{P}[I_w(t) =1], \forall t$. When $\mu_1^{(w)}=1$ (hard watermarking with strict pattern adherence), we have the Type-I error rate (probability of a false alarm) $\Pr[|{\mathbf where $\Delta^{(w)} := \sum_{i,j}\mathbb{E}[I_w(t-i) I_w(t-j)]$ is a constant that depends on $w$.

Figures (8)

  • Figure 1: Overview of the constructed dataset used for evaluation. (Left) Characteristics of the generated texts. Edits are uniformly distributed across three types--replacement, deletion, and insertion--and span lengths from one to six tokens. (Right) Examples of each edit type. For each example, we show the prompt, the watermarked LLM output, and the edited text. Edited spans are highlighted in yellow to illustrate the nature and location of edits.
  • Figure 2: A proof-of-concept illustration of combinatorial pattern-based watermarking for edit detection. Suppose a simple Green-Red alternating watermark pattern is embedded. We slide a window (of size two in this example) and check whether tokens within each window align with the expected pattern. A significant pattern violation indicates a potential post-generation edit.
  • Figure 3: Illustration of edit detection outcomes.
  • Figure 4: Four examples of edit detection statistics under the two combinatorial patterns. Each example shows the prompt text, the watermarked LLM-generated text, and the edited text. The detection threshold is marked in red, and detected edit spans are represented by red bars that fall below the threshold. We mark the true detection and missed detection in the plot, under a tolerance of $L=3$. The examples are generated using LLaMA-2-7b with watermarking strength $\delta =5.8$.
  • Figure 5: Edit detection accuracy under different edit lengths (1 to 6 tokens) and three edit types (insertion, replacement, and deletion) on OPT-1.3b (left) and Llama-2-7b (right). The watermarking strength parameter is $\delta=5.8$. In all cases we allow an evaluation tolerance of $L=3$ tokens.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 1: Detection accuracy
  • Definition 2: Type-I error rate
  • Example 3.1: Alternating Binary Pattern
  • Example 3.2: Alternating Quaternary Pattern
  • Theorem 3.1: Type-I error rate of edit detection
  • proof : Proof to Theorem \ref{['thm:false-alarm']}
  • Theorem A.1: Watermark detection error rates
  • proof
  • Remark B.1