A Transfer Attack to Image Watermarks

Yuepeng Hu; Zhengyuan Jiang; Moyang Guo; Neil Zhenqiang Gong

A Transfer Attack to Image Watermarks

Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Zhenqiang Gong

TL;DR

This work shows that watermark-based AI-generated image detectors are not robust in the no-box setting. By training dozens of diverse surrogate watermarking models and solving a joint optimization, the authors craft a single perturbation that causes all surrogates to decode targeted watermarks, thereby evading the target detector with minimal perceptual impact. Theoretical analysis provides bounds on transferability, and extensive experiments across Stable Diffusion, Midjourney, and DALL-E 2 datasets demonstrate substantial evasion gains over existing attacks and post-processing baselines, including against certifiably robust watermarks. The findings highlight a need for stronger watermark designs and defensive strategies to protect authenticity in AI-generated imagery.

Abstract

Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector based on existing watermarking methods is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API. Our code is available at: https://github.com/hifi-hyp/Watermark-Transfer-Attack.

A Transfer Attack to Image Watermarks

TL;DR

Abstract

Paper Structure (35 sections, 5 theorems, 38 equations, 16 figures, 2 tables, 1 algorithm)

This paper contains 35 sections, 5 theorems, 38 equations, 16 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Image Watermarks
Evasion Attacks
Problem Formulation
Watermark-Based Detection
Threat Model
Our Transfer Attack
Overview
Train Surrogate Watermarking Models
Formulate an Optimization Problem
Solve the Optimization Problem
Theoretical Analysis
Experiments
Experimental Setup
...and 20 more sections

Key Result

Theorem 1

For any watermarked image $x_w$, perturbation $\delta$ satisfying the constraints in Equation optim-3, and watermarks decoded by $m$ surrogate decoders for the perturbed image, the probability that the $j$th bit of the watermark decoded by $T$ matches the ground-truth watermark $w$ is bounded as fol

Figures (16)

Figure 1: Watermarked images generated by Stable Diffusion (first row) and their perturbed versions in our transfer attack that successfully evade detection (second row). The target watermarking model uses ResNet architecture. Our transfer attack uses 100 surrogate watermarking models, each of which uses CNN architecture.
Figure 2: Evasion rate of our transfer attack when the target model uses CNN (first row) or ResNet (second row) architecture and different watermark lengths (the three columns). The surrogate models use CNN architecture and watermark length of 30 bits.
Figure 3: Average perturbation of our transfer attack when the target model uses CNN (first row) or ResNet (second row) architecture and different watermark lengths (the three columns). The surrogate models use CNN architecture and watermark length of 30 bits.
Figure 4: Comparing the average perturbations of our transfer attack and common post-processing methods when they achieve the same evasion rate. The target model uses ResNet architecture and different watermark lengths (the three columns). Dataset is Stable Diffusion. Results for Midjourney are shown in Figure \ref{['cpp-linf-db-midjourney']} in Appendix.
Figure 5: (a) Comparing evasion rates of existing and our transfer attacks. The target model is ResNet and uses watermarks with different lengths. Dataset is Stable Diffusion. Similar results for Midjourney are shown in Figure \ref{['othertf-evasion-mj']} in Appendix. (b) Evasion rates and (c) average $\ell_{\infty}$-norm perturbation of our transfer attacks to different target watermarking methods.
...and 11 more figures

Theorems & Definitions (10)

Theorem 1
Definition 1: Unperturbed similarity
Definition 2: Positive transferring similarity
Definition 3: Negative transferring similarity
Definition 4: $q$-attacking strength
Definition 5: $\beta$-accurate watermarking
Lemma 1
Theorem 2
Lemma 2
Theorem 3

A Transfer Attack to Image Watermarks

TL;DR

Abstract

A Transfer Attack to Image Watermarks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (10)