Table of Contents
Fetching ...

Provable Robust Watermarking for AI-Generated Text

Xuandong Zhao, Prabhanjan Ananth, Lei Li, Yu-Xiang Wang

TL;DR

This paper formalizes provable robustness for watermarking AI-generated text by introducing Unigram-Watermark, a fixed Green-Red split K=1 watermark that modulates logits to embed a detectable pattern. It establishes a rigorous framework with generation-quality guarantees, Type I/II error bounds, and security against post-processing, including a robustness bound showing resilience to edits that scales with text length. The authors prove that the watermarked distribution is statistically close to the unwatermarked one and that detection yields exponentially decaying error probabilities as text length grows, under reasonable entropy and homophily assumptions. Empirical results across multiple LLMs and datasets demonstrate superior detection accuracy and robustness to paraphrasing and editing while preserving text quality, with additional capability to distinguish human-written text. The work provides a practical, theoretically grounded approach to enforce accountability in AI-generated content and outlines directions for extending to broader $K$-gram watermarks and cryptographic hybrids.

Abstract

We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.

Provable Robust Watermarking for AI-Generated Text

TL;DR

This paper formalizes provable robustness for watermarking AI-generated text by introducing Unigram-Watermark, a fixed Green-Red split K=1 watermark that modulates logits to embed a detectable pattern. It establishes a rigorous framework with generation-quality guarantees, Type I/II error bounds, and security against post-processing, including a robustness bound showing resilience to edits that scales with text length. The authors prove that the watermarked distribution is statistically close to the unwatermarked one and that detection yields exponentially decaying error probabilities as text length grows, under reasonable entropy and homophily assumptions. Empirical results across multiple LLMs and datasets demonstrate superior detection accuracy and robustness to paraphrasing and editing while preserving text quality, with additional capability to distinguish human-written text. The work provides a practical, theoretically grounded approach to enforce accountability in AI-generated content and outlines directions for extending to broader -gram watermarks and cryptographic hybrids.

Abstract

We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.
Paper Structure (44 sections, 21 theorems, 56 equations, 5 figures, 8 tables, 4 algorithms)

This paper contains 44 sections, 21 theorems, 56 equations, 5 figures, 8 tables, 4 algorithms.

Key Result

Theorem 3.1

Consider $\boldsymbol{h}$ as the input to the language model at step $t$, denoted as $\boldsymbol{h} = [\boldsymbol{x}, \boldsymbol{y}_{1:t-1}]$. Fix green list $G$. Let $\delta$ represent the watermark strength. For any $\boldsymbol{h}$, the $\alpha$-th order Renyi-divergence between the watermarke

Figures (5)

  • Figure 1: $z$-score comparison and text perplexity comparison.
  • Figure 2: ROC curves with corresponding AUC values for watermark detection against various attack methods. Complete results can be found in the Appendix \ref{['sec:appendix_exp']}.
  • Figure 3: Distinguishing human-written text on TOEFL dataset.
  • Figure 4: ROC curves with corresponding AUC values for watermark detection against various attack methods.
  • Figure 5: Empirical vs. theoretical false positive rates across various $\alpha$ values, using multiple green list initializations.

Theorems & Definitions (48)

  • Definition 2.1: Edit distance
  • Definition 2.2: Language model watermarking
  • Remark 2.3: Discussion on Definition \ref{['def:wm']}
  • Theorem 3.1
  • Remark 3.2: KL-divergence and other probability distance metrics
  • Theorem 3.3: No false positives (short version of Theorem \ref{['thm:no_false_positive']})
  • Remark 3.4: Controlling false positive rate
  • Theorem 3.5: Only true positive (informal version of Theorem \ref{['thm:only_true_detection']})
  • Remark 3.6
  • Theorem 3.7: Robustness to editing
  • ...and 38 more