More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles
Ruibo Chen, Yihan Wu, Xuehao Cui, Jingqi Zhang, Heng Huang
TL;DR
The paper tackles watermark ensembles for detecting LLM-generated content and shows that maximizing single-layer watermark strength can unintentionally erode entropy and harm long-horizon detectability. It introduces a general weaker distortion-free framework, $F_\lambda$, that blends the watermarked and original distributions with a mixing parameter $\lambda$ to preserve entropy across layers. The authors provide theoretical results linking entropy to detectability and demonstrating monotone entropy and green-ratio decay across layers, complemented by empirical evidence across multiple models and datasets that weaker per-layer watermarks yield superior multi-layer detectability and robustness. This entropy-preserving approach offers a practical path to more reliable, distortion-free watermark ensembles for long-generation content.
Abstract
Watermarking has emerged as a crucial technique for detecting and attributing content generated by large language models. While recent advancements have utilized watermark ensembles to enhance robustness, prevailing methods typically prioritize maximizing the strength of the watermark at every individual layer. In this work, we identify a critical limitation in this "stronger-is-better" approach: strong watermarks significantly reduce the entropy of the token distribution, which paradoxically weakens the effectiveness of watermarking in subsequent layers. We theoretically and empirically show that detectability is bounded by entropy and that watermark ensembles induce a monotonic decrease in both entropy and the expected green-list ratio across layers. To address this inherent trade-off, we propose a general framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling. Empirical evaluations demonstrate that this counter-intuitive strategy mitigates signal decay and consistently outperforms strong baselines in both detectability and robustness.
