Table of Contents
Fetching ...

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

De Zhang Lee, Han Fang, Hanyi Wang, Ee-Chien Chang

TL;DR

This work reveals that latent-space watermarks embedded in LDMs can be vulnerable to removal attacks that exploit boundary leakage, even when the watermark is indistinguishable. It develops a stealthy removal strategy that achieves far smaller perturbations than whitenoise by leveraging leaked boundary information, and proposes a boundary-hiding defense based on a secret, norm-preserving transformation coupled with a well-behaved detector. The authors prove that, under appropriate conditions, the defense neutralizes attacker advantage, making any perturbation equivalent to whitenoise, and they validate the approach with extensive experiments on Stable Diffusion variants. The study emphasizes the importance of concealing boundary information in latent-based watermarking to ensure robustness against removal while maintaining image fidelity. Overall, the work provides both a practical defense and a rigorous security framing for latent-space watermarking in AIGC.

Abstract

Digital watermarks can be embedded into AI-generated content (AIGC) by initializing the generation process with starting points sampled from a secret distribution. When combined with pseudorandom error-correcting codes, such watermarked outputs can remain indistinguishable from unwatermarked objects, while maintaining robustness under whitenoise. In this paper, we go beyond indistinguishability and investigate security under removal attacks. We demonstrate that indistinguishability alone does not necessarily guarantee resistance to adversarial removal. Specifically, we propose a novel attack that exploits boundary information leaked by the locations of watermarked objects. This attack significantly reduces the distortion required to remove watermarks -- by up to a factor of $15 \times$ compared to a baseline whitenoise attack under certain settings. To mitigate such attacks, we introduce a defense mechanism that applies a secret transformation to hide the boundary, and prove that the secret transformation effectively rendering any attacker's perturbations equivalent to those of a naive whitenoise adversary. Our empirical evaluations, conducted on multiple versions of Stable Diffusion, validate the effectiveness of both the attack and the proposed defense, highlighting the importance of addressing boundary leakage in latent-based watermarking schemes.

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

TL;DR

This work reveals that latent-space watermarks embedded in LDMs can be vulnerable to removal attacks that exploit boundary leakage, even when the watermark is indistinguishable. It develops a stealthy removal strategy that achieves far smaller perturbations than whitenoise by leveraging leaked boundary information, and proposes a boundary-hiding defense based on a secret, norm-preserving transformation coupled with a well-behaved detector. The authors prove that, under appropriate conditions, the defense neutralizes attacker advantage, making any perturbation equivalent to whitenoise, and they validate the approach with extensive experiments on Stable Diffusion variants. The study emphasizes the importance of concealing boundary information in latent-based watermarking to ensure robustness against removal while maintaining image fidelity. Overall, the work provides both a practical defense and a rigorous security framing for latent-space watermarking in AIGC.

Abstract

Digital watermarks can be embedded into AI-generated content (AIGC) by initializing the generation process with starting points sampled from a secret distribution. When combined with pseudorandom error-correcting codes, such watermarked outputs can remain indistinguishable from unwatermarked objects, while maintaining robustness under whitenoise. In this paper, we go beyond indistinguishability and investigate security under removal attacks. We demonstrate that indistinguishability alone does not necessarily guarantee resistance to adversarial removal. Specifically, we propose a novel attack that exploits boundary information leaked by the locations of watermarked objects. This attack significantly reduces the distortion required to remove watermarks -- by up to a factor of compared to a baseline whitenoise attack under certain settings. To mitigate such attacks, we introduce a defense mechanism that applies a secret transformation to hide the boundary, and prove that the secret transformation effectively rendering any attacker's perturbations equivalent to those of a naive whitenoise adversary. Our empirical evaluations, conducted on multiple versions of Stable Diffusion, validate the effectiveness of both the attack and the proposed defense, highlighting the importance of addressing boundary leakage in latent-based watermarking schemes.

Paper Structure

This paper contains 54 sections, 3 theorems, 11 equations, 10 figures, 4 tables.

Key Result

lemma 1

Suppose $T$ is uniformly sampled from $f_A$, which is the Haar measure conditioned on orthonormality, then $T$ satisfies Condition condition:transformation.

Figures (10)

  • Figure 1: Illustration of the stealthy attack, which targets and flips latent dimensions with the smallest absolute values. This strategy preserves the multivariate Gaussian distribution of the latents while flipping more bits in the watermark codeword compared to the whitenoise attack.
  • Figure 2: (a) The proportion of bits flipped in the latent space, represented by a vector of $d=4\times64\times64=16,384$ elements sampled from $\mathcal{N}_d(0,1)$, for distortions up to an $\ell_2$ norm of 200. (b) Proportion of bits flipped, focusing on distortions up to an $\ell_2$ norm of 20. The horizontal dotted line on both plots indicate the distortion required by three attack strategies to flip 5% of the bits.
  • Figure 3: Attack Success Rate (ASR) of the white-noise and stealthy attack across various distortion levels and message lengths.
  • Figure 4: (a) Attack Success Rate (ASR) of the White-Noise (W) and Stealthy (S) attackers across different message lengths. (b) ASR plot focusing on distortions up to an $\ell_2$ norm of 20, as distortions exceeding this magnitude are likely to cause significant alterations to the image.
  • Figure 5: (a) The original watermarked image. (b) Outcome of the stealthy attack with an $\ell_2$-norm distortion of $8$ on the starting point, resulting in approximately $12\%$ bit flip. (c) Outcome of the minimum distortion attack with an $\ell_2$-norm distortion of $8$ on the starting point, achieving approximately $16\%$ bit flip. The latent produced by the minimum distortion attack fails to produce a high-quality image under Stable Diffusion 2.1.
  • ...and 5 more figures

Theorems & Definitions (3)

  • lemma 1
  • lemma 2
  • Theorem 1