Table of Contents
Fetching ...

Optimizing watermarks for large language models

Bram Wouters

TL;DR

The paper addresses the problem of distinguishing LLM-generated text from human text while preserving text quality, by casting watermark design as a multi-objective optimization over a green–red split vocabulary. It derives a Pareto-optimal family of watermarks (including KGW and the proposed OPT) that maximize the expected number of green-list tokens while minimizing impact on perplexity, formalized through metrics such as $\Delta N_g$ and $\Delta \log \text{PPL}$ and governed by a thresholded rule based on $B(p_t,\mathcal{G}_t)$. Empirical results on prompts from the C4 dataset with an OPT-350m model show that OPT outperforms prior KGW watermarks in the test–text trade-off, though strong watermarking can induce dependencies that deviate from binomial assumptions and affect Pareto optimality. The work contributes a principled, systematic method to optimize identifiability versus text quality for LLM watermarks and demonstrates robustness and efficiency of the proposed approach, while also highlighting modeling assumptions and directions for extending the framework. Overall, the findings have practical implications for deploying verifiable watermarks in LLM-based systems with controllable trade-offs between detectability and text fidelity.

Abstract

With the rise of large language models (LLMs) and concerns about potential misuse, watermarks for generative LLMs have recently attracted much attention. An important aspect of such watermarks is the trade-off between their identifiability and their impact on the quality of the generated text. This paper introduces a systematic approach to this trade-off in terms of a multi-objective optimization problem. For a large class of robust, efficient watermarks, the associated Pareto optimal solutions are identified and shown to outperform the currently default watermark.

Optimizing watermarks for large language models

TL;DR

The paper addresses the problem of distinguishing LLM-generated text from human text while preserving text quality, by casting watermark design as a multi-objective optimization over a green–red split vocabulary. It derives a Pareto-optimal family of watermarks (including KGW and the proposed OPT) that maximize the expected number of green-list tokens while minimizing impact on perplexity, formalized through metrics such as and and governed by a thresholded rule based on . Empirical results on prompts from the C4 dataset with an OPT-350m model show that OPT outperforms prior KGW watermarks in the test–text trade-off, though strong watermarking can induce dependencies that deviate from binomial assumptions and affect Pareto optimality. The work contributes a principled, systematic method to optimize identifiability versus text quality for LLM watermarks and demonstrates robustness and efficiency of the proposed approach, while also highlighting modeling assumptions and directions for extending the framework. Overall, the findings have practical implications for deploying verifiable watermarks in LLM-based systems with controllable trade-offs between detectability and text fidelity.

Abstract

With the rise of large language models (LLMs) and concerns about potential misuse, watermarks for generative LLMs have recently attracted much attention. An important aspect of such watermarks is the trade-off between their identifiability and their impact on the quality of the generated text. This paper introduces a systematic approach to this trade-off in terms of a multi-objective optimization problem. For a large class of robust, efficient watermarks, the associated Pareto optimal solutions are identified and shown to outperform the currently default watermark.
Paper Structure (17 sections, 28 equations, 10 figures)

This paper contains 17 sections, 28 equations, 10 figures.

Figures (10)

  • Figure 1: Test quality, measured as the expected number of green-list tokens, versus text quality, measured as the expected log-perplexity, is shown for different watermarks. For completeness, the original language model without watermark is included (LLM). Also shown is the Pareto optimal bound. Error bars (vertical and horizontal) are omitted, as they are never larger than the marker sizes.
  • Figure 2: Test quality, measured as the power, versus text quality, measured as the expected log-perplexity, is shown for different tests ($n^*=12,15,18$) and watermarks. For completeness, the original language model without watermark is included (LLM). Also shown is the Pareto optimal bound. Error bars (vertical and horizontal) are omitted, as they are never larger than the marker sizes.
  • Figure 3: The $q$th percentile of $- \log \text{P}\mleft[ {V_t} \middle| {V_{:t}} \mright]$ is shown for different watermarks and $q=0.01, 0.1, 0.5, 0.9$ and $0.99.$
  • Figure 4: Pareto optimal bounds for different values of the hyperparameter $\gamma,$ for tests with different false-positive rates $\alpha^*.$ It shows that there is no universally "best" $\gamma.$
  • Figure 5: The power of OPT watermarks without a change in expected log-perplexity ($\,\tilde{\text{E}}\!\left[{\log \text{PPL}}\right] = \text{E}\!\left[{\log \text{PPL}}\right]\,$), as a function of the hyperparameter $\gamma,$ for tests with different false-positive rates $\alpha^*.$ The "best" value for $\gamma$ usually lies between 0.1 and 0.2.
  • ...and 5 more figures