A Watermark for Black-Box Language Models

Dara Bahri; John Wieting

A Watermark for Black-Box Language Models

Dara Bahri, John Wieting

TL;DR

This work presents a distortion-free black-box watermarking framework for language models that requires only sample access and supports recursive chaining via multiple secret keys. It encodes watermarks by scoring candidate sequences with a PRF derived from a secret key and deduplicated n-grams, enabling statistically testable detection using p-values and Fisher-style combinations across keys. Theoretical results establish distortion-free properties and ROC-AUC guarantees, while experiments on large LLMs (e.g., Mistral-7B-instruct, Gemma-7B-instruct) show strong detection performance that can rival white-box approaches, though robustness to paraphrasing and adversarial perturbations remains a challenge. Overall, the approach offers a practical, provably detectable watermarking mechanism for third-party users of LLM APIs with quantified performance and attack-resilience trade-offs.

Abstract

Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.

A Watermark for Black-Box Language Models

TL;DR

Abstract

Paper Structure (25 sections, 6 theorems, 21 equations, 7 figures, 7 tables)

This paper contains 25 sections, 6 theorems, 21 equations, 7 figures, 7 tables.

Introduction
Related Work
Algorithm
Theory
Experiments
Models, Datasets, and Hyperparameters
Evaluation Metrics
Adversarial Attacks
Baselines
Experimental Results
Overall performance of our flat and recursive schemes
Effects of Hyperparameters
Observations on detection
Conclusion
Appendix
...and 10 more sections

Key Result

Theorem 4.1

Let $X$ be any finite sequence and $P$ any prompt. Let $X_u \sim \texttt{LM}\left(\;\cdot\;|\; P\right)$ be the non-watermarked output of the conditional autoregressive language model. Let $X_w$ be the output of the watermarking procedure (Watermark in Algorithm algo:flat, for both recursive and non

Figures (7)

Figure 1: Performance of our flat scheme. Top: Detection AUC and pAUC with 1% max FPR for a range of target text lengths when there is no corruption. Bottom Left: AUC (mixed $T$'s) as a function of the average non-watermarked response entropy of the examples used in the calculation. $x$-coordinate $x$ corresponds to the bucket of examples whose entropy is between $[x-0.25, x]$ nats. Bottom Right: Effect of amount of random token corruption on AUC (mixed $T$'s).
Figure 2: Effect of the amount of (random token replacement) corruption on detection pAUC (flat scheme; mixed $T$'s) with 1% max FPR.
Figure 3: Left: Histogram of the average entropy (nats) in the LLM's underlying next-token distribution across non-watermarked response tokens. Right: A lower bound for ROC-AUC predicted by Theorem \ref{['thm:rocauc_unif']} as a function of the entropy term $\alpha$ for the range of values we observe empirically. When $m$ is large, $\alpha$ becomes a reasonable estimator of the LLM's entropy.
Figure 4: Left: A lower bound for ROC-AUC predicted by Theorem \ref{['thm:rocauc_unif']} as a function of $m$ (using optimal $\alpha = \log(m)$). Right: Same plot, but as a function of $T$ (again, using optimal $\alpha$).
Figure 5: Given a distribution over the vocabulary (taken to be of size 32k), we can estimate $\alpha$ for finite $m$ via simulation (1000 trials). We observe that when the underlying next-token distribution is uniform, $\alpha \approx \log(m)$ in a practical range for $m$. However, when the underlying distribution is Zipf (less entropy), $\alpha$ quickly deviates from $\log(m)$ as $m$ grows and the probability of sampling duplicate tokens increases.
...and 2 more figures

Theorems & Definitions (14)

Theorem 4.1: Distortion-free property
Theorem 4.2: Lower bound on detection ROC-AUC
Theorem 4.3: False positive rate
Theorem 4.4: Optimal detection for Gamma
Remark
Lemma A.1
proof
Remark
proof : Proof of Theorem \ref{['thm:distortion']}
proof : Proof of Theorem \ref{['thm:fpr_general']}
...and 4 more

A Watermark for Black-Box Language Models

TL;DR

Abstract

A Watermark for Black-Box Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (14)