The Stable Signature: Rooting Watermarks in Latent Diffusion Models
Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, Teddy Furon
TL;DR
This work addresses the challenge of responsibly deploying image-generation systems by introducing Stable Signature, an active watermarking approach embedded during generation in Latent Diffusion Models. By fine-tuning only the latent decoder and using a pre-trained watermark extractor, all generated images carry an invisible, retrievable binary signature, enabling both detection of AI-generated content and identification of its source. The authors formulate a rigorous statistical framework for detection and identification, train an extractor with whitening to ensure i.i.d.-like bit behavior, and demonstrate robust performance across a range of tasks and transformations with minimal impact on perceptual quality. They also analyze adversarial and collusive attacks, showing that while some attacks can degrade watermark integrity, the embedded watermark remains broadly resilient under practical threat models. Overall, Stable Signature offers a scalable, secure mechanism for tracing and policing generated content, with reproducibility and environmental considerations discussed.
Abstract
Generative image modeling enables a wide range of applications but raises ethical concerns about responsible deployment. This paper introduces an active strategy combining image watermarking and Latent Diffusion Models. The goal is for all generated images to conceal an invisible watermark allowing for future detection and/or identification. The method quickly fine-tunes the latent decoder of the image generator, conditioned on a binary signature. A pre-trained watermark extractor recovers the hidden signature from any generated image and a statistical test then determines whether it comes from the generative model. We evaluate the invisibility and robustness of the watermarks on a variety of generation tasks, showing that Stable Signature works even after the images are modified. For instance, it detects the origin of an image generated from a text prompt, then cropped to keep $10\%$ of the content, with $90$+$\%$ accuracy at a false positive rate below 10$^{-6}$.
