Generating Image Adversarial Examples by Embedding Digital Watermarks
Yuexin Xiang, Tiantian Li, Wei Ren, Tianqing Zhu, Kim-Kwang Raymond Choo
TL;DR
This work introduces a watermark-based approach to generate image adversarial examples for deep neural networks by embedding partial watermark features into host images using an improved DWT-based Patchwork scheme. The framework comprises three modules—Image Recognizer, Watermark Image Embedder, and Image Status Discriminator—to iteratively produce adversarial images with high efficiency, achieving an average attack success of 95.47% on CIFAR-10 and around 1.17 seconds per image. A baseline study with Gaussian-noise watermarks and a secondary DCT-based variant are explored, with the primary DWT-based method delivering superior performance. The results demonstrate a novel, fast pathway to craft adversarial examples through digital watermarking, with practical implications for evaluating DNN robustness and guiding defenses, and are complemented by code availability on GitHub.
Abstract
With the increasing attention to deep neural network (DNN) models, attacks are also upcoming for such models. For example, an attacker may carefully construct images in specific ways (also referred to as adversarial examples) aiming to mislead the DNN models to output incorrect classification results. Similarly, many efforts are proposed to detect and mitigate adversarial examples, usually for certain dedicated attacks. In this paper, we propose a novel digital watermark-based method to generate image adversarial examples to fool DNN models. Specifically, partial main features of the watermark image are embedded into the host image almost invisibly, aiming to tamper with and damage the recognition capabilities of the DNN models. We devise an efficient mechanism to select host images and watermark images and utilize the improved discrete wavelet transform (DWT) based Patchwork watermarking algorithm with a set of valid hyperparameters to embed digital watermarks from the watermark image dataset into original images for generating image adversarial examples. The experimental results illustrate that the attack success rate on common DNN models can reach an average of 95.47% on the CIFAR-10 dataset and the highest at 98.71%. Besides, our scheme is able to generate a large number of adversarial examples efficiently, concretely, an average of 1.17 seconds for completing the attacks on each image on the CIFAR-10 dataset. In addition, we design a baseline experiment using the watermark images generated by Gaussian noise as the watermark image dataset that also displays the effectiveness of our scheme. Similarly, we also propose the modified discrete cosine transform (DCT) based Patchwork watermarking algorithm. To ensure repeatability and reproducibility, the source code is available on GitHub.
