Removal Attack and Defense on AI-generated Content Latent-based Watermarking

De Zhang Lee; Han Fang; Hanyi Wang; Ee-Chien Chang

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

De Zhang Lee, Han Fang, Hanyi Wang, Ee-Chien Chang

TL;DR

This work reveals that latent-space watermarks embedded in LDMs can be vulnerable to removal attacks that exploit boundary leakage, even when the watermark is indistinguishable. It develops a stealthy removal strategy that achieves far smaller perturbations than whitenoise by leveraging leaked boundary information, and proposes a boundary-hiding defense based on a secret, norm-preserving transformation coupled with a well-behaved detector. The authors prove that, under appropriate conditions, the defense neutralizes attacker advantage, making any perturbation equivalent to whitenoise, and they validate the approach with extensive experiments on Stable Diffusion variants. The study emphasizes the importance of concealing boundary information in latent-based watermarking to ensure robustness against removal while maintaining image fidelity. Overall, the work provides both a practical defense and a rigorous security framing for latent-space watermarking in AIGC.

Abstract

Digital watermarks can be embedded into AI-generated content (AIGC) by initializing the generation process with starting points sampled from a secret distribution. When combined with pseudorandom error-correcting codes, such watermarked outputs can remain indistinguishable from unwatermarked objects, while maintaining robustness under whitenoise. In this paper, we go beyond indistinguishability and investigate security under removal attacks. We demonstrate that indistinguishability alone does not necessarily guarantee resistance to adversarial removal. Specifically, we propose a novel attack that exploits boundary information leaked by the locations of watermarked objects. This attack significantly reduces the distortion required to remove watermarks -- by up to a factor of $15 \times$ compared to a baseline whitenoise attack under certain settings. To mitigate such attacks, we introduce a defense mechanism that applies a secret transformation to hide the boundary, and prove that the secret transformation effectively rendering any attacker's perturbations equivalent to those of a naive whitenoise adversary. Our empirical evaluations, conducted on multiple versions of Stable Diffusion, validate the effectiveness of both the attack and the proposed defense, highlighting the importance of addressing boundary leakage in latent-based watermarking schemes.

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

TL;DR

Abstract

Removal Attack and Defense on AI-generated Content Latent-based Watermarking

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (3)