Optimizing watermarks for large language models
Bram Wouters
TL;DR
The paper addresses the problem of distinguishing LLM-generated text from human text while preserving text quality, by casting watermark design as a multi-objective optimization over a green–red split vocabulary. It derives a Pareto-optimal family of watermarks (including KGW and the proposed OPT) that maximize the expected number of green-list tokens while minimizing impact on perplexity, formalized through metrics such as $\Delta N_g$ and $\Delta \log \text{PPL}$ and governed by a thresholded rule based on $B(p_t,\mathcal{G}_t)$. Empirical results on prompts from the C4 dataset with an OPT-350m model show that OPT outperforms prior KGW watermarks in the test–text trade-off, though strong watermarking can induce dependencies that deviate from binomial assumptions and affect Pareto optimality. The work contributes a principled, systematic method to optimize identifiability versus text quality for LLM watermarks and demonstrates robustness and efficiency of the proposed approach, while also highlighting modeling assumptions and directions for extending the framework. Overall, the findings have practical implications for deploying verifiable watermarks in LLM-based systems with controllable trade-offs between detectability and text fidelity.
Abstract
With the rise of large language models (LLMs) and concerns about potential misuse, watermarks for generative LLMs have recently attracted much attention. An important aspect of such watermarks is the trade-off between their identifiability and their impact on the quality of the generated text. This paper introduces a systematic approach to this trade-off in terms of a multi-objective optimization problem. For a large class of robust, efficient watermarks, the associated Pareto optimal solutions are identified and shown to outperform the currently default watermark.
