Table of Contents
Fetching ...

Leveraging Optimization for Adaptive Attacks on Image Watermarks

Nils Lukas, Abdulrahman Diaa, Lucas Fenaux, Florian Kerschbaum

TL;DR

The paper addresses the robustness of no-box watermarking for image generators by formulating robustness as an optimization problem and introducing adaptive, learnable attacks that rely on differentiable surrogate keys. It proposes differentiable key generation (GKeyGen) and two attacks—Adversarial Noising and Adversarial Compression—to efficiently optimize attack parameters. Experiments on Stable Diffusion demonstrate that five watermarking methods (TRW, WDM, DWT, DWT-SVD, RivaGAN) can be evaded with negligible perceptual degradation, achieving a detection rate as low as $0.063$ while requiring less than $1$ GPU hour. The work highlights the need for rigorous robustness testing and potential certifications to ensure watermarking schemes withstand adaptive, learnable threats in practical deployment.

Abstract

Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in unethical activities. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. When evaluating watermarking algorithms and their (adaptive) attacks, it is challenging to determine whether an adaptive attack is optimal, i.e., the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at no visible degradation in image quality. Optimizing our attacks is efficient and requires less than 1 GPU hour to reduce the detection accuracy to 6.3% or less. Our findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.

Leveraging Optimization for Adaptive Attacks on Image Watermarks

TL;DR

The paper addresses the robustness of no-box watermarking for image generators by formulating robustness as an optimization problem and introducing adaptive, learnable attacks that rely on differentiable surrogate keys. It proposes differentiable key generation (GKeyGen) and two attacks—Adversarial Noising and Adversarial Compression—to efficiently optimize attack parameters. Experiments on Stable Diffusion demonstrate that five watermarking methods (TRW, WDM, DWT, DWT-SVD, RivaGAN) can be evaded with negligible perceptual degradation, achieving a detection rate as low as while requiring less than GPU hour. The work highlights the need for rigorous robustness testing and potential certifications to ensure watermarking schemes withstand adaptive, learnable threats in practical deployment.

Abstract

Untrustworthy users can misuse image generators to synthesize high-quality deepfakes and engage in unethical activities. Watermarking deters misuse by marking generated content with a hidden message, enabling its detection using a secret watermarking key. A core security property of watermarking is robustness, which states that an attacker can only evade detection by substantially degrading image quality. Assessing robustness requires designing an adaptive attack for the specific watermarking algorithm. When evaluating watermarking algorithms and their (adaptive) attacks, it is challenging to determine whether an adaptive attack is optimal, i.e., the best possible attack. We solve this problem by defining an objective function and then approach adaptive attacks as an optimization problem. The core idea of our adaptive attacks is to replicate secret watermarking keys locally by creating surrogate keys that are differentiable and can be used to optimize the attack's parameters. We demonstrate for Stable Diffusion models that such an attacker can break all five surveyed watermarking methods at no visible degradation in image quality. Optimizing our attacks is efficient and requires less than 1 GPU hour to reduce the detection accuracy to 6.3% or less. Our findings emphasize the need for more rigorous robustness testing against adaptive, learnable attackers.
Paper Structure (24 sections, 7 equations, 6 figures, 3 tables, 3 algorithms)

This paper contains 24 sections, 7 equations, 6 figures, 3 tables, 3 algorithms.

Figures (6)

  • Figure 1: An overview of our adaptive attack pipeline. The attacker prepares their attack by generating a surrogate key and leveraging optimization to find optimal attack parameters $\theta_\mathcal{A}$ (illustrated here as an encoder $\mathcal{E}$ and decoder $\mathcal{D}$) for any message. Then, the attacker generates watermarked images and applies a modification using their optimized attack to evade detection. The attack is successful if the verification procedure cannot detect the watermark in high-quality images.
  • Figure 2: The effectiveness of our attacks against all watermarks. We highlight the Pareto front for each watermarking method by dashed lines and indicate adaptive/non-adaptive attacks by colors.
  • Figure 3: A visual analysis of two adaptive attacks. The left image shows the unwatermarked output, including a high-contrast cutout of the top left corner of the image to visualize noise artifacts. On the right are images after evasion with adversarial noising (top) and adversarial compression (bottom).
  • Figure 4: Ablation studies over (left) the maximum perturbation budget $\epsilon$ in $L_\infty$ for adversarial noising and (right) the number of adversarial compressions against each watermarking method. "No Optimizations" means we did not optimize the parameters $\theta_\mathcal{A}$ of the attack.
  • Figure 5: Qualitative showcase of three kinds of images: non-watermarked, watermarked with mentioned technique, and attacked images with the strongest attack from \ref{['tab:best_attack_summary']}. The p-values and text prompts are also provided.
  • ...and 1 more figures