A Unified Framework for LLM Watermarks
Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev
TL;DR
This work introduces a principled constrained optimization framework for LLM watermarks, framing watermark design as maximizing the expected watermarked score $\mathbb{E}_G[G\cdot q(G)]$ under a distortion constraint $D(q(g), \mathbb{E}_G[q(G)], p)\le\varepsilon$ and thereby unifying disparate schemes under a single theory. By instantiating the constraint choice, the framework naturally recovers known schemes such as Red-Green, AAR/KTH, SynthID, and $\chi^2$, and also enables the design of novel, constraint-driven watermarks, including perplexity-based distance formulations. A key insight is the trade-off among quality, diversity, and watermark power, with hard constraints tending to preserve diversity and soft constraints enabling stronger power at the cost of diversity; perplexity-based constraints yield state-of-the-art detectability-quality performance. Extensive experiments across Llama3.1-8B and Ministral-3-14B validate the theoretical claims, showing Pareto-optimality with respect to the chosen constraint and illustrating how different constraint choices shape detectability and text quality. The framework therefore provides a practical, principled path to tailor watermarks for provenance, auditing, and abuse mitigation, while acknowledging limitations such as token-level optimization and non-joint detector design.
Abstract
LLM watermarks allow tracing AI-generated texts by inserting a detectable signal into their generated content. Recent works have proposed a wide range of watermarking algorithms, each with distinct designs, usually built using a bottom-up approach. Crucially, there is no general and principled formulation for LLM watermarking. In this work, we show that most existing and widely used watermarking schemes can in fact be derived from a principled constrained optimization problem. Our formulation unifies existing watermarking methods and explicitly reveals the constraints that each method optimizes. In particular, it highlights an understudied quality-diversity-power trade-off. At the same time, our framework also provides a principled approach for designing novel watermarking schemes tailored to specific requirements. For instance, it allows us to directly use perplexity as a proxy for quality, and derive new schemes that are optimal with respect to this constraint. Our experimental evaluation validates our framework: watermarking schemes derived from a given constraint consistently maximize detection power with respect to that constraint.
