Evaluation of Security of ML-based Watermarking: Copy and Removal Attacks
Vitaliy Kinakh, Brian Pulfer, Yury Belousov, Pierre Fernandez, Teddy Furon, Slava Voloshynovskiy
TL;DR
The paper investigates the security of watermarking schemes that embed information in the latent spaces of foundation models, focusing on adversarial embedding attacks. It introduces two attack classes—copy attacks and removal attacks—and evaluates them on a DINOv1-based zero-bit and multi-bit watermarking setup using the DIV2K dataset. The findings show that copy attacks achieve high success, especially for zero-bit schemes, while removal attacks are more effective overall, with targeted removals leveraging specific target images or latent states to erase or degrade watermark recoverability. The study highlights significant vulnerabilities in latent-space watermarking with current foundation models and calls for evaluating a broader range of models and benchmarking against classical watermarking approaches. These insights have practical implications for designing more secure watermarking for AI-generated and manipulated content.
Abstract
The vast amounts of digital content captured from the real world or AI-generated media necessitate methods for copyright protection, traceability, or data provenance verification. Digital watermarking serves as a crucial approach to address these challenges. Its evolution spans three generations: handcrafted, autoencoder-based, and foundation model based methods. While the robustness of these systems is well-documented, the security against adversarial attacks remains underexplored. This paper evaluates the security of foundation models' latent space digital watermarking systems that utilize adversarial embedding techniques. A series of experiments investigate the security dimensions under copy and removal attacks, providing empirical insights into these systems' vulnerabilities. All experimental codes and results are available at https://github.com/vkinakh/ssl-watermarking-attacks .
