More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Ruibo Chen; Yihan Wu; Xuehao Cui; Jingqi Zhang; Heng Huang

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Ruibo Chen, Yihan Wu, Xuehao Cui, Jingqi Zhang, Heng Huang

TL;DR

The paper tackles watermark ensembles for detecting LLM-generated content and shows that maximizing single-layer watermark strength can unintentionally erode entropy and harm long-horizon detectability. It introduces a general weaker distortion-free framework, $F_\lambda$, that blends the watermarked and original distributions with a mixing parameter $\lambda$ to preserve entropy across layers. The authors provide theoretical results linking entropy to detectability and demonstrating monotone entropy and green-ratio decay across layers, complemented by empirical evidence across multiple models and datasets that weaker per-layer watermarks yield superior multi-layer detectability and robustness. This entropy-preserving approach offers a practical path to more reliable, distortion-free watermark ensembles for long-generation content.

Abstract

Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

TL;DR

, that blends the watermarked and original distributions with a mixing parameter

to preserve entropy across layers. The authors provide theoretical results linking entropy to detectability and demonstrating monotone entropy and green-ratio decay across layers, complemented by empirical evidence across multiple models and datasets that weaker per-layer watermarks yield superior multi-layer detectability and robustness. This entropy-preserving approach offers a practical path to more reliable, distortion-free watermark ensembles for long-generation content.

Abstract

Paper Structure (41 sections, 4 theorems, 34 equations, 6 figures, 2 tables)

This paper contains 41 sections, 4 theorems, 34 equations, 6 figures, 2 tables.

Introduction
Related Work
Distortion-Free Watermark
Watermark Ensemble
Distortion-Free Watermark Ensemble
Generation
Notation.
Distortion-Free Watermark.
Watermark Ensemble.
Detection
Single-Layer Detection.
Multi-Layer Detection.
Rethinking Distortion-Free Ensemble Watermarks from an Entropy Perspective
Larger Entropy Leads to Better Detection Performance
Watermarks Reduce Entropy and Expected Green Ratio in Subsequent Layers
...and 26 more sections

Key Result

Theorem 4.1

Let $F$ denote a distortion-free watermarking operator with private key $k \sim P_{\mathcal{K}}$. Then, in expectation over the watermark key, the Shannon entropy of the token distribution after watermarking does not increase:

Figures (6)

Figure 1: The relationship between entropy, watermark strength, and detectability in distortion-free watermark ensembles. Watermark detectability is closely tied to entropy. Stronger watermarks improve detectability within the current layer. However, they significantly reduce the entropy of the token distribution. In contrast, weaker watermarks preserve more entropy, thereby enhancing detectability in subsequent layers. We propose that there exists an inherent trade-off between the detectability across layers, which can be effectively controlled by adjusting the strength of the watermark.
Figure 2: Watermark detectability as a function of text length on C4, MMW Story, and Longform QA using Llama-3.2-3B-Instruct. The threshold for false positive rate is set to 0.01%. Weaker ensemble watermarks yield consistently higher detection performance across datasets.
Figure 3: Correlation between token-distribution entropy and expected green ratio. Results are computed on the MMW Story dataset with Llama3.2-3B-Instruct using 2000 randomly sampled tokens, showing a strong positive association between entropy and expected green ratio.
Figure 4: Ablation of watermark strength on detection performance on the C4 dataset with false positive rate set to 0.01% and token length set to 250. True positive rate (TPR) is reported for varying strength parameters $\alpha$ (ENS-DiPmark) and $\lambda$ (SynthID, ENS-MCMark) on Llama3-3B and Mistral-7B. Moderate weakening consistently yields the best detectability.
Figure 5: Average token entropy per layer under different watermark strengths on the C4 dataset using Llama-3.2-3B-Instruct with sequence length fixed to 150. Weaker watermark settings consistently preserve higher entropy across layers.
...and 1 more figures

Theorems & Definitions (9)

Theorem 4.1: Entropy Decrease under Distortion-Free Watermarking
proof
Theorem 4.2: Expected Green Ratio Decrease
proof : Proof sketch
Theorem 4.3: Distortion-Freeness
Theorem 4.4: Entropy Preservation
proof : Proof Sketch
proof
proof

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

TL;DR

Abstract

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)