Publicly-Detectable Watermarking for Language Models
Jaiden Fairoze, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang
TL;DR
The paper addresses the challenge of publicly verifiable provenance for AI-generated text by proposing a publicly-detectable watermarking scheme with cryptographic guarantees. It embeds a message-signature pair into LM output using rejection sampling, signing, and error-correcting codes to tolerate entropy dips, while enabling detection via a public key without access to model weights. The authors formalize completeness, soundness, robustness, and distortion-freeness, prove security under a random oracle model, and provide extensive empirical evaluation showing distortion-freeness and practical runtime characteristics. The work advances practical content-authentication for long-form generation and enables outsourcing of watermark detection, with clear limitations on embedding density and robustness that motivate future research.
Abstract
We present a publicly-detectable watermarking scheme for LMs: the detection algorithm contains no secret information, and it is executable by anyone. We embed a publicly-verifiable cryptographic signature into LM output using rejection sampling and prove that this produces unforgeable and distortion-free (i.e., undetectable without access to the public key) text output. We make use of error-correction to overcome periods of low entropy, a barrier for all prior watermarking schemes. We implement our scheme and find that our formal claims are met in practice.
