Table of Contents
Fetching ...

Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach

Haiyun He, Yepeng Liu, Ziqiao Wang, Yongyi Mao, Yuheng Bu

TL;DR

This work addresses the need for principled, in-process watermarking of LLM outputs by jointly optimizing the watermarking scheme and detector under distortion and ultra-low FPR constraints. It derives universal optimality results showing that watermarking schemes should adapt to the LLM's generative distribution and introduces a distortion-free, distribution-adaptive approach (DAWA) that relies on a surrogate model and Gumbel-Max sampling. The token-level design translates theory into a practical algorithm with provable robustness to token alterations, while DAWA demonstrates superior detection performance and preserved text quality across large models (e.g., Llama2-13B and Mistral-8×7B) and datasets. The work also provides a pathway to extend the framework to stronger robustness against semantic-based attacks, signaling a meaningful advance for AI safety, accountability, and IP protection in real-world deployments.

Abstract

Watermarking has emerged as a crucial method to distinguish AI-generated text from human-created text. Current watermarking approaches often lack formal optimality guarantees or address the scheme and detector design separately. In this paper, we introduce a novel, unified theoretical framework for watermarking Large Language Models (LLMs) that jointly optimizes both the watermarking scheme and detector. Our approach aims to maximize detection performance while maintaining control over the worst-case false positive rate (FPR) and distortion on text quality. We derive closed-form optimal solutions for this joint design and characterize the fundamental trade-off between watermark detectability and distortion. Notably, we reveal that the optimal watermarking schemes should be adaptive to the LLM's generative distribution. Building on our theoretical insights, we propose a distortion-free, distribution-adaptive watermarking algorithm (DAWA) that leverages a surrogate model for model-agnosticism and efficiency. Experiments on Llama2-13B and Mistral-8$\times$7B models confirm the effectiveness of our approach, particularly at ultra-low FPRs. Our code is available at https://github.com/yepengliu/DAWA.

Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach

TL;DR

This work addresses the need for principled, in-process watermarking of LLM outputs by jointly optimizing the watermarking scheme and detector under distortion and ultra-low FPR constraints. It derives universal optimality results showing that watermarking schemes should adapt to the LLM's generative distribution and introduces a distortion-free, distribution-adaptive approach (DAWA) that relies on a surrogate model and Gumbel-Max sampling. The token-level design translates theory into a practical algorithm with provable robustness to token alterations, while DAWA demonstrates superior detection performance and preserved text quality across large models (e.g., Llama2-13B and Mistral-8×7B) and datasets. The work also provides a pathway to extend the framework to stronger robustness against semantic-based attacks, signaling a meaningful advance for AI safety, accountability, and IP protection in real-world deployments.

Abstract

Watermarking has emerged as a crucial method to distinguish AI-generated text from human-created text. Current watermarking approaches often lack formal optimality guarantees or address the scheme and detector design separately. In this paper, we introduce a novel, unified theoretical framework for watermarking Large Language Models (LLMs) that jointly optimizes both the watermarking scheme and detector. Our approach aims to maximize detection performance while maintaining control over the worst-case false positive rate (FPR) and distortion on text quality. We derive closed-form optimal solutions for this joint design and characterize the fundamental trade-off between watermark detectability and distortion. Notably, we reveal that the optimal watermarking schemes should be adaptive to the LLM's generative distribution. Building on our theoretical insights, we propose a distortion-free, distribution-adaptive watermarking algorithm (DAWA) that leverages a surrogate model for model-agnosticism and efficiency. Experiments on Llama2-13B and Mistral-87B models confirm the effectiveness of our approach, particularly at ultra-low FPRs. Our code is available at https://github.com/yepengliu/DAWA.
Paper Structure (37 sections, 7 theorems, 58 equations, 6 figures, 9 tables, 2 algorithms)

This paper contains 37 sections, 7 theorems, 58 equations, 6 figures, 9 tables, 2 algorithms.

Key Result

Theorem 1

The universally minimum Type-II error attained from Eq: opt-O is which is achieved by the watermarked distribution By setting $\mathsf{D}$ as total variation distance $\mathsf{D}_\mathsf{TV}$, Eq: Type-II LB can be simplified as follows:

Figures (6)

  • Figure 1: Comparison of TPR at ultra-low FPR among different watermarking methods.
  • Figure 2: Overview of LLM watermarking and detection.
  • Figure 3: Illustration of error--distortion trade-off.
  • Figure 4: Workflow of our practical algorithm (DAWA) for watermark generation and detection. \ref{['eq:A1']}: construct the sampling distribution of auxiliary variable $\zeta_t$ based on $Q_{x_t|x_1^{t-1},\mathrm{pt}}$; \ref{['eq:A2']}: sample $\zeta_t$ using the Gumbel-Max trick and a shared key; \ref{['eq:A3']}: adjust the NTP distribution of $x_t$ with $\eta$.
  • Figure 5: A toy example of the optimal detector and watermarking scheme when $T=1$. Links between $\mathcal{V}$ and $\mathcal{Z}$ suggest $P_{X_1,\zeta_1}^*> 0$.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Definition 1: $\epsilon$-distorted watermarking scheme
  • Example 1: Existing watermarking schemes as special cases
  • Theorem 1: Universally minimum Type-II error
  • Theorem 2: (Informal Statement) Jointly optimal watermarking schemes and detectors
  • Example 2: Examples of heuristic detectors
  • Lemma 3: (Informal Statement) Token-level optimal watermarking detection errors
  • Proposition 4: Robustness against token replacement
  • proof
  • Theorem 5: Theorem 10, janson1998new
  • Theorem 6: Universally minimum $f$-robust Type-II error
  • ...and 2 more