Stable Signature is Unstable: Removing Image Watermark from Diffusion Models
Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong
TL;DR
The paper interrogates the robustness of Stable Signature, a watermarking scheme that embeds a ground-truth watermark into a diffusion model's decoder for open-source diffusion. It introduces a model-targeted attack that fine-tunes the watermarked decoder using non-watermarked images, framed as two steps: estimating denoised latent vectors $\tilde{z}^i$ and retraining $D_w$ so $D_w(\tilde{z}^i)$ matches non-watermarked counterparts. Empirically, the attack achieves high evasion rates ($>94\%$) with low bitwise decoding ($<66\%$) and competitive Fréchet Inception Distance ($<27.5$), outperforming the prior MP approach in both encoder-accessible and encoderless settings. The results imply that in-generation watermarking for open-source diffusion models remains vulnerable, prompting a call for more robust watermarking techniques. The work highlights practical risks in watermark-based AI detection and informs future defense strategies against watermark-removal threats.
Abstract
Watermark has been widely deployed by industry to detect AI-generated images. A recent watermarking framework called \emph{Stable Signature} (proposed by Meta) roots watermark into the parameters of a diffusion model's decoder such that its generated images are inherently watermarked. Stable Signature makes it possible to watermark images generated by \emph{open-source} diffusion models and was claimed to be robust against removal attacks. In this work, we propose a new attack to remove the watermark from a diffusion model by fine-tuning it. Our results show that our attack can effectively remove the watermark from a diffusion model such that its generated images are non-watermarked, while maintaining the visual quality of the generated images. Our results highlight that Stable Signature is not as stable as previously thought.
