Table of Contents
Fetching ...

WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

Eva Giboulot, Teddy Furon

TL;DR

<3-5 sentence high-level summary>WaterMax tackles the challenge of watermarking LLM outputs by reframing embedding as a chunk-based, multi-draft generation problem that preserves the model and sampling strategy. It introduces a robust, detector-centric design that operates on text chunks to achieve high detectability with minimal quality loss, and provides both theoretical models and extensive experiments demonstrating strong robustness against attacks. The approach outperforms state-of-the-art methods on a complete benchmark, while incurring higher computational cost that remains parallelizable. The results support practical deployment for traceability with near distortion-free outputs across diverse LLMs and entropy regimes, with future work toward distillation to reduce cost.</3-5 sentence high-level summary>

Abstract

Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. Code available at https://github.com/eva-giboulot/WaterMax.

WaterMax: breaking the LLM watermark detectability-robustness-quality trade-off

TL;DR

<3-5 sentence high-level summary>WaterMax tackles the challenge of watermarking LLM outputs by reframing embedding as a chunk-based, multi-draft generation problem that preserves the model and sampling strategy. It introduces a robust, detector-centric design that operates on text chunks to achieve high detectability with minimal quality loss, and provides both theoretical models and extensive experiments demonstrating strong robustness against attacks. The approach outperforms state-of-the-art methods on a complete benchmark, while incurring higher computational cost that remains parallelizable. The results support practical deployment for traceability with near distortion-free outputs across diverse LLMs and entropy regimes, with future work toward distillation to reduce cost.</3-5 sentence high-level summary>

Abstract

Watermarking is a technical means to dissuade malfeasant usage of Large Language Models. This paper proposes a novel watermarking scheme, so-called WaterMax, that enjoys high detectability while sustaining the quality of the generated text of the original LLM. Its new design leaves the LLM untouched (no modification of the weights, logits, temperature, or sampling technique). WaterMax balances robustness and complexity contrary to the watermarking techniques of the literature inherently provoking a trade-off between quality and robustness. Its performance is both theoretically proven and experimentally validated. It outperforms all the SotA techniques under the most complete benchmark suite. Code available at https://github.com/eva-giboulot/WaterMax.
Paper Structure (54 sections, 4 theorems, 29 equations, 17 figures, 1 table, 1 algorithm)

This paper contains 54 sections, 4 theorems, 29 equations, 17 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

The detectability measured by the power of the test, i.e. the probability $P_{D}$ of detecting a watermarked text is the following increasing function w.r.t. $n$:

Figures (17)

  • Figure 1: Detectability as a function of text quality for different LLM architectures. WaterMax always reaches a detectability close to $1$ despite a negligible loss of quality. Probability of false-alarm fixed at $10^{-6}$, nucleus sampling ($top_p = 0.95$) at temperature $1.0$. Text quality is measured as the relative perplexity of the watermarked text over the non-watermarked text.
  • Figure 2: Theoretical power of the uniformly most powerful test \ref{['eq:power']} as a function of the number of generated texts $n$.
  • Figure 3: Theoretical power of the optimal \ref{['eq:LRT-agg']} and robust \ref{['eq:RobustDetector']} tests at $P_{FA} = 10^{-6}$ for $m=1$ as a function of the number of drafts $n$ per chunk and the number of chunks $N$.
  • Figure 4: Independence of token scores (a) and draft scores (b).
  • Figure 5: Detectability against quality of watermarking schemes using Llama-3-8b-Instruct with nucleus sampling ($top_p=0.95$) and hashing window $h=6$.
  • ...and 12 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Proposition 4.1
  • Proposition 5.1
  • Proposition 5.2