Table of Contents
Fetching ...

A Watermark for Black-Box Language Models

Dara Bahri, John Wieting

TL;DR

This work presents a distortion-free black-box watermarking framework for language models that requires only sample access and supports recursive chaining via multiple secret keys. It encodes watermarks by scoring candidate sequences with a PRF derived from a secret key and deduplicated n-grams, enabling statistically testable detection using p-values and Fisher-style combinations across keys. Theoretical results establish distortion-free properties and ROC-AUC guarantees, while experiments on large LLMs (e.g., Mistral-7B-instruct, Gemma-7B-instruct) show strong detection performance that can rival white-box approaches, though robustness to paraphrasing and adversarial perturbations remains a challenge. Overall, the approach offers a practical, provably detectable watermarking mechanism for third-party users of LLM APIs with quantified performance and attack-resilience trade-offs.

Abstract

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

A Watermark for Black-Box Language Models

TL;DR

This work presents a distortion-free black-box watermarking framework for language models that requires only sample access and supports recursive chaining via multiple secret keys. It encodes watermarks by scoring candidate sequences with a PRF derived from a secret key and deduplicated n-grams, enabling statistically testable detection using p-values and Fisher-style combinations across keys. Theoretical results establish distortion-free properties and ROC-AUC guarantees, while experiments on large LLMs (e.g., Mistral-7B-instruct, Gemma-7B-instruct) show strong detection performance that can rival white-box approaches, though robustness to paraphrasing and adversarial perturbations remains a challenge. Overall, the approach offers a practical, provably detectable watermarking mechanism for third-party users of LLM APIs with quantified performance and attack-resilience trade-offs.

Abstract

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
Paper Structure (25 sections, 6 theorems, 21 equations, 7 figures, 7 tables)

This paper contains 25 sections, 6 theorems, 21 equations, 7 figures, 7 tables.

Key Result

Theorem 4.1

Let $X$ be any finite sequence and $P$ any prompt. Let $X_u \sim \texttt{LM}\left(\;\cdot\;|\; P\right)$ be the non-watermarked output of the conditional autoregressive language model. Let $X_w$ be the output of the watermarking procedure (Watermark in Algorithm algo:flat, for both recursive and non

Figures (7)

  • Figure 1: Performance of our flat scheme. Top: Detection AUC and pAUC with 1% max FPR for a range of target text lengths when there is no corruption. Bottom Left: AUC (mixed $T$'s) as a function of the average non-watermarked response entropy of the examples used in the calculation. $x$-coordinate $x$ corresponds to the bucket of examples whose entropy is between $[x-0.25, x]$ nats. Bottom Right: Effect of amount of random token corruption on AUC (mixed $T$'s).
  • Figure 2: Effect of the amount of (random token replacement) corruption on detection pAUC (flat scheme; mixed $T$'s) with 1% max FPR.
  • Figure 3: Left: Histogram of the average entropy (nats) in the LLM's underlying next-token distribution across non-watermarked response tokens. Right: A lower bound for ROC-AUC predicted by Theorem \ref{['thm:rocauc_unif']} as a function of the entropy term $\alpha$ for the range of values we observe empirically. When $m$ is large, $\alpha$ becomes a reasonable estimator of the LLM's entropy.
  • Figure 4: Left: A lower bound for ROC-AUC predicted by Theorem \ref{['thm:rocauc_unif']} as a function of $m$ (using optimal $\alpha = \log(m)$). Right: Same plot, but as a function of $T$ (again, using optimal $\alpha$).
  • Figure 5: Given a distribution over the vocabulary (taken to be of size 32k), we can estimate $\alpha$ for finite $m$ via simulation (1000 trials). We observe that when the underlying next-token distribution is uniform, $\alpha \approx \log(m)$ in a practical range for $m$. However, when the underlying distribution is Zipf (less entropy), $\alpha$ quickly deviates from $\log(m)$ as $m$ grows and the probability of sampling duplicate tokens increases.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Theorem 4.1: Distortion-free property
  • Theorem 4.2: Lower bound on detection ROC-AUC
  • Theorem 4.3: False positive rate
  • Theorem 4.4: Optimal detection for Gamma
  • Remark
  • Lemma A.1
  • proof
  • Remark
  • proof : Proof of Theorem \ref{['thm:distortion']}
  • proof : Proof of Theorem \ref{['thm:fpr_general']}
  • ...and 4 more