Towards Optimal Statistical Watermarking
Baihe Huang, Hanlin Zhu, Banghua Zhu, Kannan Ramchandran, Michael I. Jordan, Jason D. Lee, Jiantao Jiao
TL;DR
The paper reframes statistical watermarking for text generation as a hypothesis-testing problem with a random rejection region tied to a secret watermark key, allowing precise control of Type I and Type II errors. It proves that the Uniformly Most Powerful watermark can be achieved via pseudo-random approximations of the output distribution and clipping, and it derives minimax, model-agnostic guarantees with explicit rates. In the i.i.d. token setting, it establishes the scaling $n_{\mathrm{ump}}(h,\alpha,\beta)=\Theta\left(\frac{\ln(1/h)(\ln(1/\alpha)\wedge\ln(1/\beta))}{h}\right)$ and $n_{\mathrm{minmax}}(h,\alpha,\beta)=\Theta\left(\frac{\ln(1/h)}{h}(\ln(1/\alpha)+\ln(1/\beta))\right)$, marking a significant improvement over prior $h^{-2}$ rates. The authors extend the framework to robust watermarking via a perturbation graph and LP-based optimization, and validate the theory with experiments on benchmark data, showing practical detectability with fewer tokens. Overall, the work provides a unified, information-theoretic foundation for evaluating and designing watermarking schemes with near-optimal guarantees and robustness considerations.
Abstract
We study statistical watermarking by formulating it as a hypothesis testing problem, a general framework which subsumes all previous statistical watermarking methods. Key to our formulation is a coupling of the output tokens and the rejection region, realized by pseudo-random generators in practice, that allows non-trivial trade-offs between the Type I error and Type II error. We characterize the Uniformly Most Powerful (UMP) watermark in the general hypothesis testing setting and the minimax Type II error in the model-agnostic setting. In the common scenario where the output is a sequence of $n$ tokens, we establish nearly matching upper and lower bounds on the number of i.i.d. tokens required to guarantee small Type I and Type II errors. Our rate of $Θ(h^{-1} \log (1/h))$ with respect to the average entropy per token $h$ highlights potentials for improvement from the rate of $h^{-2}$ in the previous works. Moreover, we formulate the robust watermarking problem where the user is allowed to perform a class of perturbations on the generated texts, and characterize the optimal Type II error of robust UMP tests via a linear programming problem. To the best of our knowledge, this is the first systematic statistical treatment on the watermarking problem with near-optimal rates in the i.i.d. setting, which might be of interest for future works.
