Evading Watermark based Detection of AI-Generated Content

Zhengyuan Jiang; Jinghuai Zhang; Neil Zhenqiang Gong

Evading Watermark based Detection of AI-Generated Content

Zhengyuan Jiang, Jinghuai Zhang, Neil Zhenqiang Gong

TL;DR

This work scrutinizes watermark-based detectors for AI-generated content and shows they can be defeated by carefully crafted, small-perturbation post-processing. It introduces WEvade, a family of white-box and black-box attacks that minimally perturb watermarked images to evade detection, and provides rigorous theoretical evasion-rate analyses alongside extensive empirical evaluation across HiDDeN, UDH, and Stable Diffusion. The results demonstrate that the proposed attacks achieve high evasion rates with perturbations far smaller than traditional post-processing methods, even in the presence of adversarial training, highlighting the fragility of current watermark-based defenses. The paper argues for watermarking methods with provable robustness guarantees and discusses extensions to text watermarking and real-world deployment scenarios as future directions.

Abstract

A generative AI model can generate extremely realistic-looking content, posing growing challenges to the authenticity of information. To address the challenges, watermark has been leveraged to detect AI-generated content. Specifically, a watermark is embedded into an AI-generated content before it is released. A content is detected as AI-generated if a similar watermark can be decoded from it. In this work, we perform a systematic study on the robustness of such watermark-based AI-generated content detection. We focus on AI-generated images. Our work shows that an attacker can post-process a watermarked image via adding a small, human-imperceptible perturbation to it, such that the post-processed image evades detection while maintaining its visual quality. We show the effectiveness of our attack both theoretically and empirically. Moreover, to evade detection, our adversarial post-processing method adds much smaller perturbations to AI-generated images and thus better maintain their visual quality than existing popular post-processing methods such as JPEG compression, Gaussian blur, and Brightness/Contrast. Our work shows the insufficiency of existing watermark-based detection of AI-generated content, highlighting the urgent needs of new methods. Our code is publicly available: https://github.com/zhengyuan-jiang/WEvade.

Evading Watermark based Detection of AI-Generated Content

TL;DR

Abstract

Paper Structure (27 sections, 4 theorems, 33 equations, 29 figures, 3 algorithms)

This paper contains 27 sections, 4 theorems, 33 equations, 29 figures, 3 algorithms.

Introduction
Related Work
Detecting AI-generated Content
Watermarking Methods
Watermark-based Detectors
Threat Model
Our WEvade
White-box Setting
Extending Standard Adversarial Examples to Watermarking (WEvade-W-I)
Formulating a New Optimization Problem (WEvade-W-II)
Solving the Optimization Problems
Black-box Setting
Theoretical Analysis
White-box Setting
Black-box Setting
...and 12 more sections

Key Result

Theorem 1

Given a watermarked image $I_w$ that can be detected by a single-tail or double-tail detector with a threshold $\tau>0.5$. Suppose $I_{pw}$ is found by our WEvade-W-I. $I_{pw}$ is guaranteed to evade the single-tail detector, but is guaranteed to be detected by the double-tail detector. Formally, we where $w$ is any unknown ground-truth watermark.

Figures (29)

Figure 1: Illustration of original image, watermarked image, and watermarked images post-processed by existing and our methods (last two columns) to evade detection. The watermarking method is HiDDeN. GN: Gaussian noise. GB: Gaussian blur. B/C: Brightness/Contrast. The encoder/decoder are trained via standard training (first row) or adversarial training (second row).
Figure 2: Illustration of training encoder and decoder in learning-based watermarking methods.
Figure 3: Illustration of (a) single-tail detector and (b) double-tail detector with threshold $\tau$. The bitwise accuracy of an original image $I_o$ follows a binomial distribution divided by $n$, i.e., $BA(D(I_o), w) \sim B(n,0.5)/n$. The area of the shaded region(s) is the false positive rate (FPR) of a detector.
Figure 4: False positive rate (FPR) and false negative rate (FNR) of the double-tail detector based on UDH as the threshold $\tau$ varies when there are no attacks.
Figure 5: Average bitwise accuracy and average perturbation of the post-processed watermarked images when Gaussian blur uses different standard deviations.
...and 24 more figures

Theorems & Definitions (5)

Theorem 1
Theorem 2
Theorem 3
Definition 1: $(\beta,\gamma)$-similar
Theorem 4

Evading Watermark based Detection of AI-Generated Content

TL;DR

Abstract

Evading Watermark based Detection of AI-Generated Content

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (5)