A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models

Yihan Wu; Zhengmian Hu; Junfeng Guo; Hongyang Zhang; Heng Huang

A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models

Yihan Wu, Zhengmian Hu, Junfeng Guo, Hongyang Zhang, Heng Huang

TL;DR

This work presents DiPmark, a watermarking framework for large language models that preserves the original token distribution while enabling efficient, API-free detection and provable resilience to moderate text modifications. It achieves distribution preservation through a distribution-preserving reweighting scheme combined with i.i.d. cipher permutations, and a fixed detection statistic based on green-token bias with guaranteed false-positive control. The approach is empirically validated on MT/TS tasks, exhibits fast, scalable detection, and demonstrates robustness including a GPT-4 case study. These results offer a practical, theory-backed solution for watermarking AI-generated text suitable for industry deployment and detection workflows.

Abstract

Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models. A challenge in the domain lies in preserving the distribution of original generated content after watermarking. Our research extends and improves upon existing watermarking framework, placing emphasis on the importance of a \textbf{Di}stribution-\textbf{P}reserving (DiP) watermark. Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking (distribution-preserving), is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens (resilient). DiPmark operates by selecting a random set of tokens prior to the generation of a word, then modifying the token distribution through a distribution-preserving reweight function to enhance the probability of these selected tokens during the sampling process. Extensive empirical evaluation on various language models and tasks demonstrates our approach's distribution-preserving property, accessibility, and resilience, making it a effective solution for watermarking tasks that demand impeccable quality preservation.

A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models

TL;DR

Abstract

Paper Structure (27 sections, 5 theorems, 24 equations, 18 figures, 9 tables, 2 algorithms)

This paper contains 27 sections, 5 theorems, 24 equations, 18 figures, 9 tables, 2 algorithms.

Introduction
Related Work
Preliminary
DiPmark
DiPmark Detection
DiPmark is Provably Resilient Against Text Modification
Experiments
Distribution-preserving Property
Accessibility
Resilience and provable resilience
Ablation study: watermark detectability
Case study: watermarking GPT-4 by DiPmark
Conclusion
Future Work
Related Work
...and 12 more sections

Key Result

Theorem 4.3

DiP-reweight is a distribution-preserving reweight strategy, i.e., for all $\bm{x}_{1:n} \in \mathcal{V}$ and any $i \leq n$, it holds that $P_M(x_i|\bm{x}_{1:i-1}) = \mathbb{E}_{\theta_i\sim P_\Theta}[P_W(x_i|\bm{x}_{1:i-1},\theta_i)].$

Figures (18)

Figure 1: Illustration of the $P_W^\alpha$-reweight and DiP-reweight. Top. In $P_W^\alpha$-reweight, the token probabilities within the interval $[0, \alpha]$ are adjusted to 0, while the rest are adjust to 1. Bottom. In DiP-reweight, the probability mass within $[0,\alpha]$ is transferred to the probability mass within $[1-\alpha,1]$.
Figure 2: Empirical verification of distribution-preserving property of DiPmark. Top: Violin plot of Machine Translation BLEU. Bottom: Violin plot of Text Summarization Perplexity. We can see the Soft watermarks kirchenbauer2023watermark significantly degrade the text quality, while DiPmarks preserve the text quality.
Figure 3: Certified radius $\epsilon_0$ of DiPmark with text modification with FPR smaller than 1%. The x-axis refers the certified radius and the y-axis refers the percentage of watermarked sequences that are resilience under any text modification attacks with budget $\epsilon_0$.
Figure 4: Top: Average perplexity vs green token ratio with $\gamma=0.5$ on text generation tasks. Bottom: Average green token ratio with different $\gamma$.
Figure 5: Cumulative histogram of the number of green tokens in the 100 watermarked gpt-4 generated sequences. The green line represents the threshold with FPR smaller than 1%.
...and 13 more figures

Theorems & Definitions (16)

Definition 3.1: Reweight strategy
Definition 3.2: Distribution-preserving reweight strategy
Definition 3.3: Distribution-preserving watermark
Definition 4.1: $P_W^\alpha$-reweight strategy
Definition 4.2: DiP-reweight strategy
Theorem 4.3
Corollary 4.4
Definition 5.1: Green token ratio
Theorem 5.2: Concentration bound of $\Phi(\gamma,\bm{x}_{1:n})$
Definition 6.1: Certified radius
...and 6 more

A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models

TL;DR

Abstract

A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (16)