A Unified Framework for LLM Watermarks

Thibaud Gloaguen; Robin Staab; Nikola Jovanović; Martin Vechev

A Unified Framework for LLM Watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev

TL;DR

This work introduces a principled constrained optimization framework for LLM watermarks, framing watermark design as maximizing the expected watermarked score $\mathbb{E}_G[G\cdot q(G)]$ under a distortion constraint $D(q(g), \mathbb{E}_G[q(G)], p)\le\varepsilon$ and thereby unifying disparate schemes under a single theory. By instantiating the constraint choice, the framework naturally recovers known schemes such as Red-Green, AAR/KTH, SynthID, and $\chi^2$, and also enables the design of novel, constraint-driven watermarks, including perplexity-based distance formulations. A key insight is the trade-off among quality, diversity, and watermark power, with hard constraints tending to preserve diversity and soft constraints enabling stronger power at the cost of diversity; perplexity-based constraints yield state-of-the-art detectability-quality performance. Extensive experiments across Llama3.1-8B and Ministral-3-14B validate the theoretical claims, showing Pareto-optimality with respect to the chosen constraint and illustrating how different constraint choices shape detectability and text quality. The framework therefore provides a practical, principled path to tailor watermarks for provenance, auditing, and abuse mitigation, while acknowledging limitations such as token-level optimization and non-joint detector design.

Abstract

LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.

A Unified Framework for LLM Watermarks

TL;DR

This work introduces a principled constrained optimization framework for LLM watermarks, framing watermark design as maximizing the expected watermarked score

under a distortion constraint

and thereby unifying disparate schemes under a single theory. By instantiating the constraint choice, the framework naturally recovers known schemes such as Red-Green, AAR/KTH, SynthID, and

, and also enables the design of novel, constraint-driven watermarks, including perplexity-based distance formulations. A key insight is the trade-off among quality, diversity, and watermark power, with hard constraints tending to preserve diversity and soft constraints enabling stronger power at the cost of diversity; perplexity-based constraints yield state-of-the-art detectability-quality performance. Extensive experiments across Llama3.1-8B and Ministral-3-14B validate the theoretical claims, showing Pareto-optimality with respect to the chosen constraint and illustrating how different constraint choices shape detectability and text quality. The framework therefore provides a practical, principled path to tailor watermarks for provenance, auditing, and abuse mitigation, while acknowledging limitations such as token-level optimization and non-joint detector design.

Abstract

Paper Structure (64 sections, 9 theorems, 93 equations, 8 figures, 1 table, 7 algorithms)

This paper contains 64 sections, 9 theorems, 93 equations, 8 figures, 1 table, 7 algorithms.

Introduction
This work:
Main contributions
Background and Related Work
LLM Watermarks
Watermark Sampling Mechanisms
Optimal Watermark Design
Our Framework
Setting
Optimization Objective
Adding Constraints
Hard and Soft Constraints
Penalized Formulation
Applying our Framework
Capturing Existing Watermarking Sampling Mechanisms
...and 49 more sections

Key Result

Theorem 4.1

Let $p\in\Delta(\Sigma)$ have full support, let $g\in\mathbb{G}$ be non-constant, and let $\varepsilon>0$. Consider where $\mathrm{KL}(q(g)\|p)=\sum_{u\in\Sigma} q(g)_u\log\frac{q(g)_u}{p_u}$ with the convention $0\log 0 = 0$. Define, for $\delta\ge 0$, and let Then:

Figures (8)

Figure 1: Overview of Our Framework. We find that most prior watermarks can be viewed from the following angle (left): given a next-token probability distribution $p$ and token scores $g$, pseudorandomly sampled from $G$, they compute a watermarked probability distribution $q(g)$. A text is considered watermarked if the sum of pseudorandom scores is above a given threshold. Therefore, we frame watermarking as a constrained optimization problem (middle): maximizing the expected score while controlling the watermark distortion. In particular, the constraint balances the watermark quality and diversity. This formulation captures most existing prior watermarks, and enables designing new optimal schemes with respect to a given constraint (right).
Figure 2: Comparison of the Detectability–Constraint Trade-off. We compare the trade-off between watermark detectability (TPR@1) and different constraints (KL divergence (left), $\chi^2$ distance (middle), and soft PPL (right)). We find that, for each constraint, the corresponding scheme has the best detectability-constraint trade-off. Responses are $200$-token-long replies by Llama3.1-8B with temperature $0.7$ and $1000$ prompts from ELI5. We highlight above each subplot the scheme derived from the plot constraint.
Figure 3: Comparison of the Detectability–Quality Trade-off. We compare the trade-off between watermark detectability (TPR@1) and text quality (log PPL) for different constraint instantiations and for different $\varepsilon$. The left figure shows the hard constraints and the right one shows the soft constraints. The dashed line corresponds to the log PPL of the unwatermarked replies. Responses are 200-token-long replies by Llama3.1-8B with temperature $0.7$ and $1000$ prompts from ELI5.
Figure 4: Diversity–Quality Trade-off at Fixed TPR. We compare the trade-off between watermark impact on diversity (Self-BLEU) and quality ($\log$ PPL) given a fixed TPR@1 of $0.95$. A high self-BLEU score corresponds to a low output diversity. The error bars correspond to the standard deviation and the dotted gray line shows the correlation. The PPL and TPR@1 are measured over $1,000$ prompts from ELI5 (\ref{['sec:evaluation:detectability']}), while the diversity is measured with the procedure described in \ref{['sec:evaluation:diversity']}. We find that, given a fixed power, diversity and quality are negatively correlated.
Figure 5: Additional Detectability–Constraint Trade-off We compare the trade-off between watermark detectability (TPR@1) and different constraints (KL divergence between $\mathbb{E}_G[q(G)]$ and $p$ (left), and hard PPL (right)). Responses are 200-token-long replies by Llama3.1-8B with temperature $0.7$ and $1000$ prompts from ELI5.
...and 3 more figures

Theorems & Definitions (18)

Theorem 4.1
proof
Theorem 4.2
proof
Theorem 4.3
proof
Theorem 4.4
proof
Theorem 4.5
proof
...and 8 more

A Unified Framework for LLM Watermarks

TL;DR

Abstract

A Unified Framework for LLM Watermarks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (18)