A Transfer Attack to Image Watermarks
Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Zhenqiang Gong
TL;DR
This work shows that watermark-based AI-generated image detectors are not robust in the no-box setting. By training dozens of diverse surrogate watermarking models and solving a joint optimization, the authors craft a single perturbation that causes all surrogates to decode targeted watermarks, thereby evading the target detector with minimal perceptual impact. Theoretical analysis provides bounds on transferability, and extensive experiments across Stable Diffusion, Midjourney, and DALL-E 2 datasets demonstrate substantial evasion gains over existing attacks and post-processing baselines, including against certifiably robust watermarks. The findings highlight a need for stronger watermark designs and defensive strategies to protect authenticity in AI-generated imagery.
Abstract
Watermark has been widely deployed by industry to detect AI-generated images. The robustness of such watermark-based detector against evasion attacks in the white-box and black-box settings is well understood in the literature. However, the robustness in the no-box setting is much less understood. In this work, we propose a new transfer evasion attack to image watermark in the no-box setting. Our transfer attack adds a perturbation to a watermarked image to evade multiple surrogate watermarking models trained by the attacker itself, and the perturbed watermarked image also evades the target watermarking model. Our major contribution is to show that, both theoretically and empirically, watermark-based AI-generated image detector based on existing watermarking methods is not robust to evasion attacks even if the attacker does not have access to the watermarking model nor the detection API. Our code is available at: https://github.com/hifi-hyp/Watermark-Transfer-Attack.
