Table of Contents
Fetching ...

Watermarking Language Models for Many Adaptive Users

Aloni Cohen, Alexander Hoover, Gabe Schoenbach

TL;DR

The paper addresses robust watermarking of language-model outputs under adaptive prompting and collusion, aiming to both detect AI-generated text and trace it to specific users. It introduces a unified AEB-robustness framework and a black-box reduction from zero-bit to ${L}$-bit watermarking, enabling lossless or lossy embedding with provable guarantees. A key contribution is a multi-user watermarking construction that remains undetectable while supporting collusion-resilient tracing via robust fingerprinting codes, with efficiency guarantees ( detect in ${O}(\,\log n\,)$ and trace in ${O}(n\log n)$) and short keys. The work also shows how to preserve the intrinsic robustness of base zero-bit schemes, discusses extensions without undetectability, and outlines practical considerations like key duplication to defend against adaptive adversaries. Overall, it advances provable, scalable watermarking for safeguarding provenance and accountability in interactive AI systems.

Abstract

We study watermarking schemes for language models with provable guarantees. As we show, prior works offer no robustness guarantees against adaptive prompting: when a user queries a language model more than once, as even benign users do. And with just a single exception (Christ and Gunn, 2024), prior works are restricted to zero-bit watermarking: machine-generated text can be detected as such, but no additional information can be extracted from the watermark. Unfortunately, merely detecting AI-generated text may not prevent future abuses. We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users, even in the face of adaptive prompting. We construct multi-user watermarking schemes from undetectable, adaptively robust, zero-bit watermarking schemes (and prove that the undetectable zero-bit scheme of Christ, Gunn, and Zamir (2024) is adaptively robust). Importantly, our scheme provides both zero-bit and multi-user assurances at the same time. It detects shorter snippets just as well as the original scheme, and traces longer excerpts to individuals. The main technical component is a construction of message-embedding watermarks from zero-bit watermarks. Ours is the first generic reduction between watermarking schemes for language models. A challenge for such reductions is the lack of a unified abstraction for robustness -- that marked text is detectable even after edits. We introduce a new unifying abstraction called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text "approximates enough blocks" of model-generated output.

Watermarking Language Models for Many Adaptive Users

TL;DR

The paper addresses robust watermarking of language-model outputs under adaptive prompting and collusion, aiming to both detect AI-generated text and trace it to specific users. It introduces a unified AEB-robustness framework and a black-box reduction from zero-bit to -bit watermarking, enabling lossless or lossy embedding with provable guarantees. A key contribution is a multi-user watermarking construction that remains undetectable while supporting collusion-resilient tracing via robust fingerprinting codes, with efficiency guarantees ( detect in and trace in ) and short keys. The work also shows how to preserve the intrinsic robustness of base zero-bit schemes, discusses extensions without undetectability, and outlines practical considerations like key duplication to defend against adaptive adversaries. Overall, it advances provable, scalable watermarking for safeguarding provenance and accountability in interactive AI systems.

Abstract

We study watermarking schemes for language models with provable guarantees. As we show, prior works offer no robustness guarantees against adaptive prompting: when a user queries a language model more than once, as even benign users do. And with just a single exception (Christ and Gunn, 2024), prior works are restricted to zero-bit watermarking: machine-generated text can be detected as such, but no additional information can be extracted from the watermark. Unfortunately, merely detecting AI-generated text may not prevent future abuses. We introduce multi-user watermarks, which allow tracing model-generated text to individual users or to groups of colluding users, even in the face of adaptive prompting. We construct multi-user watermarking schemes from undetectable, adaptively robust, zero-bit watermarking schemes (and prove that the undetectable zero-bit scheme of Christ, Gunn, and Zamir (2024) is adaptively robust). Importantly, our scheme provides both zero-bit and multi-user assurances at the same time. It detects shorter snippets just as well as the original scheme, and traces longer excerpts to individuals. The main technical component is a construction of message-embedding watermarks from zero-bit watermarks. Ours is the first generic reduction between watermarking schemes for language models. A challenge for such reductions is the lack of a unified abstraction for robustness -- that marked text is detectable even after edits. We introduce a new unifying abstraction called AEB-robustness. AEB-robustness provides that the watermark is detectable whenever the edited text "approximates enough blocks" of model-generated output.
Paper Structure (59 sections, 17 theorems, 51 equations, 4 figures)

This paper contains 59 sections, 17 theorems, 51 equations, 4 figures.

Key Result

Lemma 2.4

For $\lambda,L \in \mathbb{N}$ and $0\le \delta <1$, define Then, after throwing $k\ge k^*(L,\delta)$ balls into $L$ bins, fewer than $\delta L$ bins are empty except with probability at most $e^{-\lambda}$.

Figures (4)

  • Figure 1: Visualization of a string $\hat{T}$ containing three approximate blocks from original generations $T_1$ and $T_2$. This $\hat{T}$ would satisfy the $R_3(\lambda, (Q_i)_i, (T_i)_i, \hat{T})$ robustness condition.
  • Figure 2: Pseudocode for $L$-bit watermarking scheme of $\mathcal{W}=(\mathsf{KeyGen}, \mathsf{Wat},\mathsf{Extract})$ from a block-by-block zero-bit watermarking scheme $\mathcal{W}'=(\mathsf{KeyGen}',\mathsf{Wat}',\mathsf{Detect}')$.
  • Figure 3: Intermediate versions of $\mathsf{Wat}$ and $\mathsf{Extract}$, used to produce outputs that are independent of the keys, used in the proof of Lemma \ref{['lem-msg-robust']}. The boxed lines are used only in the indicated hybrid. At setup, we additionally initialize $\mathcal{Q}_{i,b}, \mathcal{G}_{i,b}$ to $(~)$ for all $i,b$.
  • Figure 4: Pseudocode for construction of $\mathcal{W}=(\mathsf{KeyGen},\mathsf{Wat},\mathsf{Detect},\mathsf{Trace})$ from fingerprinting code $\mathsf{FP}=(\mathsf{FP.Gen}',\mathsf{FP.Trace}')$ and $L$-bit message embedding scheme $\mathcal{W}' = (\mathsf{KeyGen}', \mathsf{Wat}', \mathsf{Extract}')$, e.g. Figure \ref{['fig-embedding']}. The construction is defined for any public parameters $\sf{pp} = (n, c, \delta)$, where $n, c > 1,$ and $0 \le \delta < 1$.

Theorems & Definitions (79)

  • Definition 2.1: Fingerprinting codes -- syntax IEEE:BS98
  • Definition 2.2: Feasible sets
  • Definition 2.3: Fingerprinting codes -- robust security ACM:BKM10
  • Lemma 2.4
  • Definition 3.1: Zero-bit watermarking -- Syntax
  • Definition 3.2: Watermarking syntax -- $L$-bit
  • Definition 3.3: Soundness -- $L$-bit
  • Definition 3.4: Undetectability -- $L$-bit
  • Definition 3.5: Robustness condition
  • Definition 3.6: $(\delta,R)$-Robust extraction -- $L$-bit, adaptive
  • ...and 69 more