Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

Yihan Wu; Ruibo Chen; Zhengmian Hu; Yanshuo Chen; Junfeng Guo; Hongyang Zhang; Heng Huang

Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

Yihan Wu, Ruibo Chen, Zhengmian Hu, Yanshuo Chen, Junfeng Guo, Hongyang Zhang, Heng Huang

TL;DR

The paper analyzes distortion-free language-model watermarks under finite-key collisions, proving that strong distortion-free preservation cannot hold when key collisions occur. It formalizes the PDA-rule framework, demonstrates the inevitability of key collisions, and characterizes three levels of distortion-free behavior. To address distribution bias from collisions, it introduces the beta-watermark, a beta-weighted, model-agnostic extension of permutation-based reweighting that reduces bias while preserving detectability. Extensive experiments on MBart, BART-large, and LLaMA-2 across translation, summarization, and generation show beta-watermark can meaningfully lower distribution bias with controllable trade-offs in watermark strength and detection efficiency, suggesting a practical path forward for distortion-aware watermarking. The work highlights the need for novel key-space designs to approach true distortion-free guarantees under realistic constraints and motivates further study of robust, scalable watermarking and detection mechanisms.

Abstract

Language model (LM) watermarking techniques inject a statistical signal into LM-generated content by substituting the random sampling process with pseudo-random sampling, using watermark keys as the random seed. Among these statistical watermarking approaches, distortion-free watermarks are particularly crucial because they embed watermarks into LM-generated content without compromising generation quality. However, one notable limitation of pseudo-random sampling compared to true-random sampling is that, under the same watermark keys (i.e., key collision), the results of pseudo-random sampling exhibit correlations. This limitation could potentially undermine the distortion-free property. Our studies reveal that key collisions are inevitable due to the limited availability of watermark keys, and existing distortion-free watermarks exhibit a significant distribution bias toward the original LM distribution in the presence of key collisions. Moreover, achieving a perfect distortion-free watermark is impossible as no statistical signal can be embedded under key collisions. To reduce the distribution bias caused by key collisions, we introduce a new family of distortion-free watermarks--beta-watermark. Experimental results support that the beta-watermark can effectively reduce the distribution bias under key collisions.

Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

TL;DR

Abstract

Paper Structure (25 sections, 11 theorems, 45 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 25 sections, 11 theorems, 45 equations, 6 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Preliminary
Curse of Key Collision on Distortion-Free Watermarks
Existing Distortion-Free PDA-Rules
Non-Existence of Strongly Distortion-Free Watermarks under Key Collisions
Reducing Distribution Bias via Beta-Watermark
Experiments
Distortion-Free
Ablation Study
Conclusion
Algorithms of Beta-watermark
Related Work
Missing Proofs
Proof of Theorem \ref{['thm:multikey detection']}
...and 10 more sections

Key Result

Theorem 4.1

Denote by $S(\cdot|\textup{sk})$ the test statistic. Under the null hypothesis $H_0$, given a random text $\bm{x}_{1:n}$, we have $\Pr(S(\bm{x}_{1:n}|\textup{sk}_0)-\mathbb{E}_{H_0}[S]\geq t|H_0)= p_0(t),$ i.e., $p_0(t)$ is the false positive rate of threshold $t$ under single secret key detection.

Figures (6)

Figure 1: Pseudo-randomness in a token sampling step for watermarked LMs. "Before" refers the original LM token distribution and "After" refers the watermarked token distribution. Given a fixed watermark key, both inverse-sampling and Gumbel reparametrization methods become deterministic. In contrast, the permute-reweight method retains elements of randomness.
Figure 2: Illustration of Beta PDA-rule.
Figure 3: Performance of different watermarks under one-time generation. Top: Violin plot of Text Summarization Perplexity. Bottom: Violin plot of Machine Translation BLEU. We can see the weakly distortion-free watermarks preserve the generation quality.
Figure 4: Left. Trade-off between distribution bias and watermark strength under key collision. The TPR is measured under 1% FPR. We can see $\Delta$ Perplexity (distribution bias) increase with the TPR. Right. AUC score of different watermarks under varying attack strength $\epsilon$ on text generation task.
Figure 5: ROC curve of TPR vs FPR.
...and 1 more figures

Theorems & Definitions (27)

Definition 3.1: PDA-rule
Definition 3.2: Distortion-free PDA-rule
Definition 3.3: Key collision
Theorem 4.1: Detection efficiency with multiple secret keys
Corollary 4.2
Definition 4.3: Step-wise distortion-free watermark
Definition 4.4: Weakly distortion-free watermark
Definition 4.5: Strongly distortion-free watermark
Theorem 4.6
Corollary 4.7
...and 17 more

Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

TL;DR

Abstract

Distortion-free Watermarks are not Truly Distortion-free under Watermark Key Collisions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (27)