RoboSignature: Robust Signature and Watermarking on Network Attacks
Aryaman Shaan, Garvit Banga, Raghav Mantri
TL;DR
This work investigates watermarking robustness in Latent Diffusion Models and identifies vulnerability to network-level adversarial fine-tuning. It introduces Random Key Attack and Gradual Random Key Attack to exploit the $2^{48}$-bit watermark key space and demonstrate how watermarks can be corrupted or misembedded. To counter these threats, the authors adapt Tamper Resistant Fine-Tuning (TAR) from LLMs to LDMs, enabling a defense that preserves image quality while maintaining watermark integrity to a degree. Experimental results show that modified TAR achieves high PSNR and watermark accuracy on evaluation data, but aggressive attacks still reduce bit-traceability, underscoring the need for stronger, principled defenses for watermarking in diffusion models. The findings highlight practical implications for securely open-sourcing watermarked generative systems and guide future research on tamper-resistant defenses for diffusion-based content generation.
Abstract
Generative models have enabled easy creation and generation of images of all kinds given a single prompt. However, this has also raised ethical concerns about what is an actual piece of content created by humans or cameras compared to model-generated content like images or videos. Watermarking data generated by modern generative models is a popular method to provide information on the source of the content. The goal is for all generated images to conceal an invisible watermark, allowing for future detection or identification. The Stable Signature finetunes the decoder of Latent Diffusion Models such that a unique watermark is rooted in any image produced by the decoder. In this paper, we present a novel adversarial fine-tuning attack that disrupts the model's ability to embed the intended watermark, exposing a significant vulnerability in existing watermarking methods. To address this, we further propose a tamper-resistant fine-tuning algorithm inspired by methods developed for large language models, tailored to the specific requirements of watermarking in LDMs. Our findings emphasize the importance of anticipating and defending against potential vulnerabilities in generative systems.
