Table of Contents
Fetching ...

Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

Hanlin Zhang, Benjamin L. Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, Boaz Barak

TL;DR

The paper proves a general impossibility result: strong watermarking of generative models cannot be achieved under mild, realistic assumptions, even for secret-key schemes. It introduces a universal attack framework leveraging a quality oracle and a perturbation oracle to perform a quality-preserving random walk that erases watermarks while maintaining or exceeding the original quality. The authors formalize the attack, provide a rigorous probabilistic analysis based on mixing times of a perturbation-driven graph, and validate the approach with empirical attacks on three LLM watermark schemes, demonstrating substantial watermark removal with only minor quality degradation. The work suggests caution for relying on strong watermarking for provenance and pushes toward weak watermarking or alternative provenance and cryptographic approaches, with implications for policy and safety in AI deployment.

Abstract

Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.

Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models

TL;DR

The paper proves a general impossibility result: strong watermarking of generative models cannot be achieved under mild, realistic assumptions, even for secret-key schemes. It introduces a universal attack framework leveraging a quality oracle and a perturbation oracle to perform a quality-preserving random walk that erases watermarks while maintaining or exceeding the original quality. The authors formalize the attack, provide a rigorous probabilistic analysis based on mixing times of a perturbation-driven graph, and validate the approach with empirical attacks on three LLM watermark schemes, demonstrating substantial watermark removal with only minor quality degradation. The work suggests caution for relying on strong watermarking for provenance and pushes toward weak watermarking or alternative provenance and cryptographic approaches, with implications for policy and safety in AI deployment.

Abstract

Watermarking generative models consists of planting a statistical signal (watermark) in a model's output so that it can be later verified that the output was generated by the given model. A strong watermarking scheme satisfies the property that a computationally bounded attacker cannot erase the watermark without causing significant quality degradation. In this paper, we study the (im)possibility of strong watermarking schemes. We prove that, under well-specified and natural assumptions, strong watermarking is impossible to achieve. This holds even in the private detection algorithm setting, where the watermark insertion and detection algorithms share a secret key, unknown to the attacker. To prove this result, we introduce a generic efficient watermark attack; the attacker is not required to know the private key of the scheme or even which scheme is used. Our attack is based on two assumptions: (1) The attacker has access to a "quality oracle" that can evaluate whether a candidate output is a high-quality response to a prompt, and (2) The attacker has access to a "perturbation oracle" which can modify an output with a nontrivial probability of maintaining quality, and which induces an efficiently mixing random walk on high-quality outputs. We argue that both assumptions can be satisfied in practice by an attacker with weaker computational capabilities than the watermarked model itself, to which the attacker has only black-box access. Furthermore, our assumptions will likely only be easier to satisfy over time as models grow in capabilities and modalities. We demonstrate the feasibility of our attack by instantiating it to attack three existing watermarking schemes for large language models: Kirchenbauer et al. (2023), Kuditipudi et al. (2023), and Zhao et al. (2023). The same attack successfully removes the watermarks planted by all three schemes, with only minor quality degradation.
Paper Structure (46 sections, 10 theorems, 16 equations, 19 figures, 5 tables, 1 algorithm)

This paper contains 46 sections, 10 theorems, 16 equations, 19 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

For every (public or secret-key) watermarking setting satisfying the above assumptions, there is an efficient attacker that given a prompt $x$ and (watermarked) output $y$ with probability close to one, uses the quality and perturbation oracles to obtain an output $y'$ such that (1) $y'$ is not wate

Figures (19)

  • Figure 1: An outline of our quality-preserving random walk attack schema (The differences with original watermarked text are highlighted.). We consider the set of all possible outputs and within it the set of all high-quality outputs (with respect to the original prompt). For any quality-preserving watermarking scheme with a low false-positive rate, the set of watermarked outputs (green) will be a small subset of the high-quality output (orange). We then take a random walk on the set of high-quality outputs to arrive at a non-watermarked output (red) by generating candidate neighbors through the perturbation oracle and using the quality oracle to reject all low-quality candidates. We instantiate our attack for text as follows: given a watermarked text, at each iteration, the malicious user can generate span substitutions using a small masked LM, while making sure the response quality with respect to the user query does not decrease according to a quality oracle such as GPT-3.5 or a reward model.
  • Figure 2: Detection and quality w.r.t. the number of perturbation steps using Llama2-7B with the KGW scheme kirchenbauer2023watermark. Left: z-score (standard deviation deviation from the null hypothesis of non-watermarked content). Right: GPT-4 Judge score. Results are aggregated across 12 examples and the order of comparands.
  • Figure 3: (a) Detection performance before (Watermarked) and after (Attack) our attack using Llama2-7B with KGW kirchenbauer2023watermark. (b) Comparative evaluation on watermarked texts against texts post-attack using GPT-4 as a judge. Scoring criteria: 1=post-attack response is much better than watermarked one, 2=slightly better, 3=of similar quality, 4=slightly worse, 5=much worse. Each example is included as two data points, one for each ordering of the two outputs in the GPT-4 query.
  • Figure 4: Detection performance and w.r.t. the watermarked text length using Llama2-7B-Chat with KGW kirchenbauer2023watermark. Results are aggregated across hundreds of examples.
  • Figure 5: Qualitative examples of the watermarked images after (left) and before (right) our attack for two watermarking schemes. Images are generated by prompting stable-diffusion-2-base with the prompt "A long and winding beach, tropical, bright, simple, by Studio Ghibli and Greg Rutkowski, artstation\\ n". Detection and quality evaluation results: Invisible Watermark (p-value $3.5e\text{-}15 \rightarrow 0.2354$, CLIP score $34.82 \rightarrow 33.60$, GPT-4 Judge $0$), Stable Signature (p-value $1.3e\text{-}5 \rightarrow 0.468$, CLIP score $32.27 \rightarrow 31.58$ , GPT-4 Judge $0$).
  • ...and 14 more figures

Theorems & Definitions (22)

  • Theorem 1: Main result, informal
  • Definition 1: Generative models
  • Definition 2: Quality function
  • Definition 3: Secret-key watermarking scheme
  • Definition 4: False negative and false positive $\epsilon$-rates
  • Definition 5: Erasure attack against watermarking schemes
  • Definition 6: Perturbation oracle
  • Definition 7: Graph representation of perturbation oracles
  • Theorem 2
  • Corollary 1
  • ...and 12 more