Table of Contents
Fetching ...

Unforgeable Watermarks for Language Models via Robust Signatures

Huijia Lin, Kameron Shahabi, Min Jae Song

TL;DR

This work constructs the first undetectable watermarking scheme that is robust, unforgeable, and recoverable with respect to substitutions, and shows that any standard digital signature scheme can be boosted to a robust one using property-preserving hash functions.

Abstract

Language models now routinely produce text that is difficult to distinguish from human writing, raising the need for robust tools to verify content provenance. Watermarking has emerged as a promising countermeasure, with existing work largely focused on model quality preservation and robust detection. However, current schemes provide limited protection against false attribution. We strengthen the notion of soundness by introducing two novel guarantees: unforgeability and recoverability. Unforgeability prevents adversaries from crafting false positives, texts that are far from any output from the watermarked model but are nonetheless flagged as watermarked. Recoverability provides an additional layer of protection: whenever a watermark is detected, the detector identifies the source text from which the flagged content was derived. Together, these properties strengthen content ownership by linking content exclusively to its generating model, enabling secure attribution and fine-grained traceability. We construct the first undetectable watermarking scheme that is robust, unforgeable, and recoverable with respect to substitutions (i.e., perturbations in Hamming metric). The key technical ingredient is a new cryptographic primitive called robust (or recoverable) digital signatures, which allow verification of messages that are close to signed ones, while preventing forgery of messages that are far from all previously signed messages. We show that any standard digital signature scheme can be boosted to a robust one using property-preserving hash functions (Boyle, LaVigne, and Vaikuntanathan, ITCS 2019).

Unforgeable Watermarks for Language Models via Robust Signatures

TL;DR

This work constructs the first undetectable watermarking scheme that is robust, unforgeable, and recoverable with respect to substitutions, and shows that any standard digital signature scheme can be boosted to a robust one using property-preserving hash functions.

Abstract

Language models now routinely produce text that is difficult to distinguish from human writing, raising the need for robust tools to verify content provenance. Watermarking has emerged as a promising countermeasure, with existing work largely focused on model quality preservation and robust detection. However, current schemes provide limited protection against false attribution. We strengthen the notion of soundness by introducing two novel guarantees: unforgeability and recoverability. Unforgeability prevents adversaries from crafting false positives, texts that are far from any output from the watermarked model but are nonetheless flagged as watermarked. Recoverability provides an additional layer of protection: whenever a watermark is detected, the detector identifies the source text from which the flagged content was derived. Together, these properties strengthen content ownership by linking content exclusively to its generating model, enabling secure attribution and fine-grained traceability. We construct the first undetectable watermarking scheme that is robust, unforgeable, and recoverable with respect to substitutions (i.e., perturbations in Hamming metric). The key technical ingredient is a new cryptographic primitive called robust (or recoverable) digital signatures, which allow verification of messages that are close to signed ones, while preventing forgery of messages that are far from all previously signed messages. We show that any standard digital signature scheme can be boosted to a robust one using property-preserving hash functions (Boyle, LaVigne, and Vaikuntanathan, ITCS 2019).
Paper Structure (78 sections, 25 theorems, 109 equations, 7 figures, 13 algorithms)

This paper contains 78 sections, 25 theorems, 109 equations, 7 figures, 13 algorithms.

Key Result

Theorem 1.5

Let $Q$ be any language model, $\Phi$ be any closeness predicate, and $n : \mathbb{N} \rightarrow \mathbb{N}$ be any polynomial block size. Assume there exists both Then there exists a watermarking scheme for $Q$ that is (secret or public key) robust and unforgeableAgainst adversaries outputting sufficiently long strings $\zeta$. with respect to the closeness predicate $\mathsf{EBC}[\Phi, n]$. If

Figures (7)

  • Figure 1: Illustration of watermarking guarantees. Correctness and soundness (blue and purple) are statistical guarantees, whereas robustness and unforgeability (green and red) can be computational. Consequently, the green and red regions may contain unwatermarked and watermarked content respectively, but no efficient adversary can find them.
  • Figure 2: Illustration of $\mathsf{EBC}[\Phi, n]$-closeness. Green dotted lines indicate $\Phi$-closeness. The string $\zeta^*$ is $\mathsf{EBC}[\Phi, n]$-close to $y$ because its blocks are $\Phi$-close to three consecutive blocks of $y$, whereas $\zeta'$ may be far: although $\zeta'$ is close to a substring of $y$ of length $2n$, this substring is not aligned with two blocks of $y$.
  • Figure 3: Generic PPH recovery algorithm
  • Figure 4: ${\mathsf{RGen}}$
  • Figure 5: Pseudocode for watermarking scheme $({\mathsf{Gen}}, {\mathsf{Wat}}, \mathsf{Ver})$.
  • ...and 2 more figures

Theorems & Definitions (81)

  • Definition 1.1: Robust and unforgeable watermarking, informal version of Definitions \ref{['def:robustness-general']} and \ref{['def:unforgeability-general']}
  • Definition 1.2: Robust recovery, informal version of Definition \ref{['def:recoverability-general']}
  • Definition 1.3: Unforgeable recovery, informal version of Definition \ref{['def:unforgeable-recoverability-general']}
  • Definition 1.4: Robust/Recoverable signatures, informal version of Definitions \ref{['def:rds-formal']} and \ref{['def:rds-recoverability-formal']}
  • Theorem 1.5: General watermarking framework, informal version of Theorem \ref{['thm:watermark-main']}
  • Remark 1.6: Subtlety between robustness and unforgeability
  • Definition 1.7: Difference recovery, informal version of Definition \ref{['def:PPH-recoverability']}
  • Theorem 1.8: General robust signature framework, informal statement of Theorem \ref{['thm:rds-appdx-main']}
  • Definition 3.1: Hamming
  • Definition 3.2: Every-block close
  • ...and 71 more