Table of Contents
Fetching ...

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

Ruibo Chen, Yihan Wu, Xuehao Cui, Jingqi Zhang, Heng Huang

TL;DR

The paper tackles watermark ensembles for detecting LLM-generated content and shows that maximizing single-layer watermark strength can unintentionally erode entropy and harm long-horizon detectability. It introduces a general weaker distortion-free framework, $F_\lambda$, that blends the watermarked and original distributions with a mixing parameter $\lambda$ to preserve entropy across layers. The authors provide theoretical results linking entropy to detectability and demonstrating monotone entropy and green-ratio decay across layers, complemented by empirical evidence across multiple models and datasets that weaker per-layer watermarks yield superior multi-layer detectability and robustness. This entropy-preserving approach offers a practical path to more reliable, distortion-free watermark ensembles for long-generation content.

Abstract

Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.

More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles

TL;DR

The paper tackles watermark ensembles for detecting LLM-generated content and shows that maximizing single-layer watermark strength can unintentionally erode entropy and harm long-horizon detectability. It introduces a general weaker distortion-free framework, , that blends the watermarked and original distributions with a mixing parameter to preserve entropy across layers. The authors provide theoretical results linking entropy to detectability and demonstrating monotone entropy and green-ratio decay across layers, complemented by empirical evidence across multiple models and datasets that weaker per-layer watermarks yield superior multi-layer detectability and robustness. This entropy-preserving approach offers a practical path to more reliable, distortion-free watermark ensembles for long-generation content.

Abstract

Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
Paper Structure (41 sections, 4 theorems, 34 equations, 6 figures, 2 tables)

This paper contains 41 sections, 4 theorems, 34 equations, 6 figures, 2 tables.

Key Result

Theorem 4.1

Let $F$ denote a distortion-free watermarking operator with private key $k \sim P_{\mathcal{K}}$. Then, in expectation over the watermark key, the Shannon entropy of the token distribution after watermarking does not increase:

Figures (6)

  • Figure 1: The relationship between entropy, watermark strength, and detectability in distortion-free watermark ensembles. Watermark detectability is closely tied to entropy. Stronger watermarks improve detectability within the current layer. However, they significantly reduce the entropy of the token distribution. In contrast, weaker watermarks preserve more entropy, thereby enhancing detectability in subsequent layers. We propose that there exists an inherent trade-off between the detectability across layers, which can be effectively controlled by adjusting the strength of the watermark.
  • Figure 2: Watermark detectability as a function of text length on C4, MMW Story, and Longform QA using Llama-3.2-3B-Instruct. The threshold for false positive rate is set to 0.01%. Weaker ensemble watermarks yield consistently higher detection performance across datasets.
  • Figure 3: Correlation between token-distribution entropy and expected green ratio. Results are computed on the MMW Story dataset with Llama3.2-3B-Instruct using 2000 randomly sampled tokens, showing a strong positive association between entropy and expected green ratio.
  • Figure 4: Ablation of watermark strength on detection performance on the C4 dataset with false positive rate set to 0.01% and token length set to 250. True positive rate (TPR) is reported for varying strength parameters $\alpha$ (ENS-DiPmark) and $\lambda$ (SynthID, ENS-MCMark) on Llama3-3B and Mistral-7B. Moderate weakening consistently yields the best detectability.
  • Figure 5: Average token entropy per layer under different watermark strengths on the C4 dataset using Llama-3.2-3B-Instruct with sequence length fixed to 150. Weaker watermark settings consistently preserve higher entropy across layers.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 4.1: Entropy Decrease under Distortion-Free Watermarking
  • proof
  • Theorem 4.2: Expected Green Ratio Decrease
  • proof : Proof sketch
  • Theorem 4.3: Distortion-Freeness
  • Theorem 4.4: Entropy Preservation
  • proof : Proof Sketch
  • proof
  • proof