Publicly-Detectable Watermarking for Language Models

Jaiden Fairoze; Sanjam Garg; Somesh Jha; Saeed Mahloujifar; Mohammad Mahmoody; Mingyuan Wang

Publicly-Detectable Watermarking for Language Models

Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

TL;DR

The paper addresses the challenge of publicly verifiable provenance for AI-generated text by proposing a publicly-detectable watermarking scheme with cryptographic guarantees. It embeds a message-signature pair into LM output using rejection sampling, signing, and error-correcting codes to tolerate entropy dips, while enabling detection via a public key without access to model weights. The authors formalize completeness, soundness, robustness, and distortion-freeness, prove security under a random oracle model, and provide extensive empirical evaluation showing distortion-freeness and practical runtime characteristics. The work advances practical content-authentication for long-form generation and enables outsourcing of watermark detection, with clear limitations on embedding density and robustness that motivate future research.

Abstract

We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice.

Publicly-Detectable Watermarking for Language Models

TL;DR

Abstract

Paper Structure (38 sections, 6 theorems, 14 equations, 9 figures, 9 tables, 6 algorithms)

This paper contains 38 sections, 6 theorems, 14 equations, 9 figures, 9 tables, 6 algorithms.

Introduction
Security Model
Preliminaries
Assumptions
Entity Interaction
Model provider
User
Definitions
On the unforgeability of our scheme
The relationship between $\delta_s$, $\delta_c$, and $\delta_r$
Protocol
Technical Overview
Dealing with low entropy sequences
Private Generation Algorithm
Public Detection Algorithm
...and 23 more sections

Key Result

Theorem 3.1

The scheme $\mathcal{PDWS}$ defined in alg:generatoralg:watermarkalg:detector is an $(\ell, \ell + \ell\cdot\lambda_c, 2(\ell + \ell \cdot \lambda_c), \exp(-\Omega(\alpha)))$-publicly-detectable watermark.

Figures (9)

Figure 1: Our core gadget. Embedding is a three-step process as designated by (1) through (3). First (1), $\ell_m$ tokens are sampled natively from the LM. These tokens $\boldsymbol t$ are hashed twice with two different hash functions, producing $h_1 \gets H_1(\boldsymbol t)$ and $h_2 \gets H_2(\boldsymbol t)$. Second (2), $h_2$ is signed with the secret key $\mathsf{sk}$, error-corrected, and randomized with $h_1$. The final product is a pseudorandom bitstring $\boldsymbol{c} \gets h_2 \oplus \mathsf{Encode}_{\gamma}(\mathsf{Sign}_\mathsf{sk}(h_1))$. Lastly (3), each bit $c_i$ the randomized codeword is embedded into the next $\ell_c$ tokens by rejection sampling. That is, the $i$-th block of $\ell_c$ tokens are sampled such that the hash of the block yields the $i$-th bit of the randomized codeword, i.e., $\forall i \in \lambda_c, H_3\left(s_{i,1}^{c_{i,1}} \cdots s_{i,\ell_c}^{c_{i,\ell_c}}\right) = c_i$ where each $s$ is one token.
Figure 2: Aggregated text quality score assignments from GPT-4 Turbo for each generation algorithm configuration over the Mistral 7B model jiang2023mistral. For $\mathsf{asymmetric}$, the configurations from left to right represent the most compact (lowest quality) to least compact (highest quality) parameters. Each bar is the aggregation of GPT-4 Turbo-assigned quality scores for 250 distinct prompt completions. The error bars show the 95% interval data spread. Observe that no protocol clearly outperforms the others: the mean score falls between 27 and 40 for all protocols, and each one exhibits large quality spreads. Note that even the baseline decoder, $\mathsf{plain}$, follows this pattern. This suggests the watermarking protocols are indeed distortion-free.
Figure 3: Generation and detection runtimes for each generation algorithm over the Mistral 7B model jiang2023mistral. Five distinct generations were aggregated for each of the 10 random prompts from the news-like portion of the C4 dataset raffel2020exploring. The error bars show the 95% interval data spread. On average, the fastest to slowest generation runtimes were for: $\mathsf{plain}$ (expected as this is the baseline), $\mathsf{asymmetric}$, then $\mathsf{symmetric}$ and $\mathsf{plain\ with\ bits}$ (the latter two are about equal with the dominant cost being the reduction to a binary vocabulary). For detection, $\mathsf{asymmetric}$ runs much faster than $\mathsf{symmetric}$ which is expected given they run in linear vs. quadratic time, respectively, in the number of tokens $n$.
Figure 4: Generation runtimes for each variant of our protocol over the OPT-2.7B model zhang2022opt. Generation runtimes for different parameter instantiations of our protocol. 5 completions were generated for each of the 10 random prompts from the news-like portion of the C4 dataset raffel2020exploring. $\ell$ denotes the signature segment length, $\beta$ denotes the bit size, and $\gamma$ denotes the maximum number of planted errors. The error bars show the 95% interval data spread. Comparing non-error-corrected ($\gamma = 0$) vs. error-corrected ($\gamma = 2$) runtimes for prompts with high variance (prompts 4, 5, 6, 8, and 9 for parameters $\ell = 32,\ \beta = 2$ and $\ell = 16,\ \beta = 2$), we can see a clear reduction in the variance and mean runtime when error correction is applied to overcome low entropy periods. Note that we expect to sample $E_{\lambda}(\ell, \beta) = 2^\beta \cdot \frac{\lambda}{\beta} \cdot \ell$ characters to embed the signature codeword. Thus, in expectation, $E_\lambda(16, 1) = E_\lambda(16, 2) < E_\lambda(32, 2)$ where $\lambda = 328$ or 360 depending on if $\gamma = 0$ or 2. Our empirical runtimes align with this.
Figure 5: Tiling structure to compress multiple message-signature pairs. This is possible because the signature codeword itself is pseudorandom.
...and 4 more figures

Theorems & Definitions (22)

Definition 2.1: Auto-regressive Model
Definition 2.2: Public-Key Signature Scheme
Definition 2.3: Unforgeability
Definition 2.4: Hamming Distance
Definition 2.5: Error-Correcting Code
Definition 2.7: Publicly-Detectable Watermarking Scheme
Definition 2.8: Completeness
Definition 2.9: Soundness/Unforgeability
Definition 2.10: Robustness
Definition 2.11: Distortion-freeness
...and 12 more

Publicly-Detectable Watermarking for Language Models

TL;DR

Abstract

Publicly-Detectable Watermarking for Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)