Table of Contents
Fetching ...

Undetectable Watermarks for Language Models

Miranda Christ, Sam Gunn, Or Zamir

TL;DR

The paper investigates the feasibility of embedding undetectable watermarks in language model outputs, achieving cryptographic indistinguishability from unwatermarked outputs under adaptive prompting. It defines rigorous notions of undetectability, completeness, and substring-completeness based on empirical entropy and PRF-based seeds, and proves constructions that are undetectable, sound, and complete (with a sublinear in length entropy bound) under cryptographic assumptions. The authors also discuss the necessity of entropy conditions, provide a simplified random-oracle-based construction, and present a full practical scheme, along with attacks and limits on removability. The work lays a principled foundation for covertly signaling model outputs without sacrificing quality and analyzes the practical and theoretical trade-offs, including potential vulnerabilities and open problems in watermark robustness.

Abstract

Recent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by noticeably altering the output distribution. We ask: Is it possible to introduce a watermark without incurring any detectable change to the output distribution? To this end we introduce a cryptographically-inspired notion of undetectable watermarks for language models. That is, watermarks can be detected only with the knowledge of a secret key; without the secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model. In particular, it is impossible for a user to observe any degradation in the quality of the text. Crucially, watermarks should remain undetectable even when the user is allowed to adaptively query the model with arbitrarily chosen prompts. We construct undetectable watermarks based on the existence of one-way functions, a standard assumption in cryptography.

Undetectable Watermarks for Language Models

TL;DR

The paper investigates the feasibility of embedding undetectable watermarks in language model outputs, achieving cryptographic indistinguishability from unwatermarked outputs under adaptive prompting. It defines rigorous notions of undetectability, completeness, and substring-completeness based on empirical entropy and PRF-based seeds, and proves constructions that are undetectable, sound, and complete (with a sublinear in length entropy bound) under cryptographic assumptions. The authors also discuss the necessity of entropy conditions, provide a simplified random-oracle-based construction, and present a full practical scheme, along with attacks and limits on removability. The work lays a principled foundation for covertly signaling model outputs without sacrificing quality and analyzes the practical and theoretical trade-offs, including potential vulnerabilities and open problems in watermark robustness.

Abstract

Recent advances in the capabilities of large language models such as GPT-4 have spurred increasing concern about our ability to detect AI-generated text. Prior works have suggested methods of embedding watermarks in model outputs, by noticeably altering the output distribution. We ask: Is it possible to introduce a watermark without incurring any detectable change to the output distribution? To this end we introduce a cryptographically-inspired notion of undetectable watermarks for language models. That is, watermarks can be detected only with the knowledge of a secret key; without the secret key, it is computationally intractable to distinguish watermarked outputs from those of the original model. In particular, it is impossible for a user to observe any degradation in the quality of the text. Crucially, watermarks should remain undetectable even when the user is allowed to adaptively query the model with arbitrarily chosen prompts. We construct undetectable watermarks based on the existence of one-way functions, a standard assumption in cryptography.
Paper Structure (39 sections, 15 theorems, 41 equations, 6 algorithms)

This paper contains 39 sections, 15 theorems, 41 equations, 6 algorithms.

Key Result

Theorem 1

For any model $\mathsf{Model}$ we construct a watermarking scheme $\mathcal{W}$ that is undetectable, sound, and $O(\lambda \sqrt{L})$-complete.

Theorems & Definitions (36)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9
  • Theorem 1
  • ...and 26 more