Table of Contents
Fetching ...

HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions

Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Chen-Fu Chen, Haim Permuter, Sajani Vithana, Flavio P. Calmon

TL;DR

This work formulates a principled minimax framework for embedding watermarks in LLM outputs, aiming to maximize detectability while preserving text quality, especially in low-entropy generations. It introduces SimplexWater, a binary-score watermark tied to distance-maximizing codes and proven minimax-optimal, and HeavyWater, which uses heavy-tailed score distributions to boost detection; both are distortion-free via optimal transport and adaptable to any side-information scheme. Theoretical analysis connects watermarks to coding theory and shows Gumbel watermark as a special case within the OT framework, while empirical results demonstrate superior detection with minimal distortion across coding and QA benchmarks and multiple open-weight models. The work also discusses randomness efficiency and practical considerations like hashing schemes, robustness, and computational overhead, offering a path toward robust, verifiable watermarking for AI-generated text. Overall, HeavyWater and SimplexWater advance watermarking design by exploiting tail behavior and coding-theory insights, enabling reliable authentication of machine-generated content with controlled textual impact.

Abstract

Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks -- such as coding -- where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater

HeavyWater and SimplexWater: Distortion-Free LLM Watermarks for Low-Entropy Next-Token Predictions

TL;DR

This work formulates a principled minimax framework for embedding watermarks in LLM outputs, aiming to maximize detectability while preserving text quality, especially in low-entropy generations. It introduces SimplexWater, a binary-score watermark tied to distance-maximizing codes and proven minimax-optimal, and HeavyWater, which uses heavy-tailed score distributions to boost detection; both are distortion-free via optimal transport and adaptable to any side-information scheme. Theoretical analysis connects watermarks to coding theory and shows Gumbel watermark as a special case within the OT framework, while empirical results demonstrate superior detection with minimal distortion across coding and QA benchmarks and multiple open-weight models. The work also discusses randomness efficiency and practical considerations like hashing schemes, robustness, and computational overhead, offering a path toward robust, verifiable watermarking for AI-generated text. Overall, HeavyWater and SimplexWater advance watermarking design by exploiting tail behavior and coding-theory insights, enabling reliable authentication of machine-generated content with controlled textual impact.

Abstract

Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks -- such as coding -- where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater

Paper Structure

This paper contains 44 sections, 13 theorems, 115 equations, 15 figures, 4 tables, 3 algorithms.

Key Result

Proposition 1

Let $\lambda\in\left[\frac{1}{2},1\right)$. For $f\in\mathcal{F}_{\mathsf{bin}}$, define the vector $f_i=[f(i,1),\dots,f(i,k)]\!\in\!\{0,1\}^k$ for each $i\in\mathcal{X}$. Then, where $d_H(a,b)=\sum_{i=1}^k\mathbf{1}_{\{a_i\neq b_i\}}$ denotes the Hamming distance between $a,b\in\{0,1\}^k$ and $\mathbf{1}_{\{\cdot\}}$ is the indicator function.

Figures (15)

  • Figure 1: HeavyWater and SimplexWater demonstrate favorable detection performance (measured by p-values) with minimal distortion to the base unwatermarked model (measured by Cross-Entropy). See Section \ref{['sec:numerics']} for details.
  • Figure 2: Visualization of the components of watermarking design.
  • Figure 3: Left: Tradeoff between detection (measured by $p$-value) and distortion (measured by Cross-Entropy) --- SimplexWater and HeavyWater achieve higher detection rates while preserving token distributions close to the base unwatermarked model. Right: Detection gained by employing our watermark under various randomness generation schemes and several sliding window sizes $h$. Both SimplexWater and HeavyWater provide a significant improvement of up to $250\%$ and a decrease in distortion.
  • Figure 4: Our watermarks require fewer tokens to reach a given detection strength (p-value) with zero distortion.
  • Figure C.5: Tail integrals of different score difference distributions: Higher the tail integral, better the detection.
  • ...and 10 more figures

Theorems & Definitions (28)

  • Definition 1: Low-entropy distributions
  • Remark 1: Watermarking in Low-Entropy Regime
  • Proposition 1
  • Theorem 1: Maximum Detection Gap Upper Bound
  • Definition 1: Simplex Code
  • Theorem 2: SimplexWater Optimality
  • Theorem 3: Gumbel Watermark as OT
  • Theorem 4: Detection Gap
  • proof : Proof of Prop.1
  • Claim 1
  • ...and 18 more