A Watermark for Black-Box Language Models
Dara Bahri, John Wieting
TL;DR
This work presents a distortion-free black-box watermarking framework for language models that requires only sample access and supports recursive chaining via multiple secret keys. It encodes watermarks by scoring candidate sequences with a PRF derived from a secret key and deduplicated n-grams, enabling statistically testable detection using p-values and Fisher-style combinations across keys. Theoretical results establish distortion-free properties and ROC-AUC guarantees, while experiments on large LLMs (e.g., Mistral-7B-instruct, Gemma-7B-instruct) show strong detection performance that can rival white-box approaches, though robustness to paraphrasing and adversarial perturbations remains a challenge. Overall, the approach offers a practical, provably detectable watermarking mechanism for third-party users of LLM APIs with quantified performance and attack-resilience trade-offs.
Abstract
Watermarking has recently emerged as an effective strategy for detecting the outputs of large language models (LLMs). Most existing schemes require white-box access to the model's next-token probability distribution, which is typically not accessible to downstream users of an LLM API. In this work, we propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM (i.e. black-box access), boasts a distortion-free property, and can be chained or nested using multiple secret keys. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
