Table of Contents
Fetching ...

Evading Watermark based Detection of AI-Generated Content

Zhengyuan Jiang, Jinghuai Zhang, Neil Zhenqiang Gong

TL;DR

This work scrutinizes watermark-based detectors for AI-generated content and shows they can be defeated by carefully crafted, small-perturbation post-processing. It introduces WEvade, a family of white-box and black-box attacks that minimally perturb watermarked images to evade detection, and provides rigorous theoretical evasion-rate analyses alongside extensive empirical evaluation across HiDDeN, UDH, and Stable Diffusion. The results demonstrate that the proposed attacks achieve high evasion rates with perturbations far smaller than traditional post-processing methods, even in the presence of adversarial training, highlighting the fragility of current watermark-based defenses. The paper argues for watermarking methods with provable robustness guarantees and discusses extensions to text watermarking and real-world deployment scenarios as future directions.

Abstract

A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.

Evading Watermark based Detection of AI-Generated Content

TL;DR

This work scrutinizes watermark-based detectors for AI-generated content and shows they can be defeated by carefully crafted, small-perturbation post-processing. It introduces WEvade, a family of white-box and black-box attacks that minimally perturb watermarked images to evade detection, and provides rigorous theoretical evasion-rate analyses alongside extensive empirical evaluation across HiDDeN, UDH, and Stable Diffusion. The results demonstrate that the proposed attacks achieve high evasion rates with perturbations far smaller than traditional post-processing methods, even in the presence of adversarial training, highlighting the fragility of current watermark-based defenses. The paper argues for watermarking methods with provable robustness guarantees and discusses extensions to text watermarking and real-world deployment scenarios as future directions.

Abstract

A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.
Paper Structure (27 sections, 4 theorems, 33 equations, 29 figures, 3 algorithms)

This paper contains 27 sections, 4 theorems, 33 equations, 29 figures, 3 algorithms.

Key Result

Theorem 1

Given a watermarked image $I_w$ that can be detected by a single-tail or double-tail detector with a threshold $\tau>0.5$. Suppose $I_{pw}$ is found by our WEvade-W-I. $I_{pw}$ is guaranteed to evade the single-tail detector, but is guaranteed to be detected by the double-tail detector. Formally, we where $w$ is any unknown ground-truth watermark.

Figures (29)

  • Figure 1: Illustration of original image, watermarked image, and watermarked images post-processed by existing and our methods (last two columns) to evade detection. The watermarking method is HiDDeN. GN: Gaussian noise. GB: Gaussian blur. B/C: Brightness/Contrast. The encoder/decoder are trained via standard training (first row) or adversarial training (second row).
  • Figure 2: Illustration of training encoder and decoder in learning-based watermarking methods.
  • Figure 3: Illustration of (a) single-tail detector and (b) double-tail detector with threshold $\tau$. The bitwise accuracy of an original image $I_o$ follows a binomial distribution divided by $n$, i.e., $BA(D(I_o), w) \sim B(n,0.5)/n$. The area of the shaded region(s) is the false positive rate (FPR) of a detector.
  • Figure 4: False positive rate (FPR) and false negative rate (FNR) of the double-tail detector based on UDH as the threshold $\tau$ varies when there are no attacks.
  • Figure 5: Average bitwise accuracy and average perturbation of the post-processed watermarked images when Gaussian blur uses different standard deviations.
  • ...and 24 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Definition 1: $(\beta,\gamma)$-similar
  • Theorem 4