Table of Contents
Fetching ...

Warfare:Breaking the Watermark Protection of AI-Generated Content

Guanlin Li, Yifei Chen, Jie Zhang, Shangwei Guo, Han Qiu, Guoyin Wang, Jiwei Li, Tianwei Zhang

TL;DR

Warfare reveals that current AIGC watermarking can be defeated under a black-box threat model by a unified attack framework that couples diffusion-based preprocessing with GAN-driven watermark manipulation to remove or forge watermarks. The proposed Warfare and its faster variant Warfare-Plus demonstrate strong attack effectiveness across multiple datasets and watermark lengths while maintaining content quality, highlighting serious risks to content attribution and policy enforcement. The work also characterizes practical defenses and positions Warfare as a tool for red-teaming and robust watermark design, emphasizing the need for more resilient watermarking in real-world AIGC deployments.

Abstract

AI-Generated Content (AIGC) is rapidly expanding, with services using advanced generative models to create realistic images and fluent text. Regulating such content is crucial to prevent policy violations, such as unauthorized commercialization or unsafe content distribution. Watermarking is a promising solution for content attribution and verification, but we demonstrate its vulnerability to two key attacks: (1) Watermark removal, where adversaries erase embedded marks to evade regulation, and (2) Watermark forging, where they generate illicit content with forged watermarks, leading to misattribution. We propose Warfare, a unified attack framework leveraging a pre-trained diffusion model for content processing and a generative adversarial network for watermark manipulation. Evaluations across datasets and embedding setups show that Warfare achieves high success rates while preserving content quality. We further introduce Warfare-Plus, which enhances efficiency without compromising effectiveness. The code can be found in https://github.com/GuanlinLee/warfare.

Warfare:Breaking the Watermark Protection of AI-Generated Content

TL;DR

Warfare reveals that current AIGC watermarking can be defeated under a black-box threat model by a unified attack framework that couples diffusion-based preprocessing with GAN-driven watermark manipulation to remove or forge watermarks. The proposed Warfare and its faster variant Warfare-Plus demonstrate strong attack effectiveness across multiple datasets and watermark lengths while maintaining content quality, highlighting serious risks to content attribution and policy enforcement. The work also characterizes practical defenses and positions Warfare as a tool for red-teaming and robust watermark design, emphasizing the need for more resilient watermarking in real-world AIGC deployments.

Abstract

AI-Generated Content (AIGC) is rapidly expanding, with services using advanced generative models to create realistic images and fluent text. Regulating such content is crucial to prevent policy violations, such as unauthorized commercialization or unsafe content distribution. Watermarking is a promising solution for content attribution and verification, but we demonstrate its vulnerability to two key attacks: (1) Watermark removal, where adversaries erase embedded marks to evade regulation, and (2) Watermark forging, where they generate illicit content with forged watermarks, leading to misattribution. We propose Warfare, a unified attack framework leveraging a pre-trained diffusion model for content processing and a generative adversarial network for watermark manipulation. Evaluations across datasets and embedding setups show that Warfare achieves high success rates while preserving content quality. We further introduce Warfare-Plus, which enhances efficiency without compromising effectiveness. The code can be found in https://github.com/GuanlinLee/warfare.
Paper Structure (31 sections, 2 equations, 14 figures, 12 tables)

This paper contains 31 sections, 2 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Overview of Warfare. (1) Collecting watermarked data from the target AIGC service or Internet. (2) Using a public pre-trained denoising model to purify the watermarked data. (3) Adopting the watermarked and mediator data to train a GAN, which can be used to remove or forge the watermark. $x'$ is the watermarked image. $\hat{x}$ is the mediator image. The subscript $i$ is omitted.
  • Figure 2: Bit Acc for different tasks during the training stage on CelebA.
  • Figure 3: The first column is clean images. The second is watermarked images. The third is the output of $\mathrm{DM}_l$. The fourth is the output of $\mathrm{DM}_s$.
  • Figure 4: Bit Acc with training epoch increasing.
  • Figure 5: Clean images and corresponding outputs from Warfare. The top two rows are clean images.
  • ...and 9 more figures