Stable Signature is Unstable: Removing Image Watermark from Diffusion Models

Yuepeng Hu; Zhengyuan Jiang; Moyang Guo; Neil Gong

Stable Signature is Unstable: Removing Image Watermark from Diffusion Models

Yuepeng Hu, Zhengyuan Jiang, Moyang Guo, Neil Gong

TL;DR

The paper interrogates the robustness of Stable Signature, a watermarking scheme that embeds a ground-truth watermark into a diffusion model's decoder for open-source diffusion. It introduces a model-targeted attack that fine-tunes the watermarked decoder using non-watermarked images, framed as two steps: estimating denoised latent vectors $\tilde{z}^i$ and retraining $D_w$ so $D_w(\tilde{z}^i)$ matches non-watermarked counterparts. Empirically, the attack achieves high evasion rates ($>94\%$) with low bitwise decoding ($<66\%$) and competitive Fréchet Inception Distance ($<27.5$), outperforming the prior MP approach in both encoder-accessible and encoderless settings. The results imply that in-generation watermarking for open-source diffusion models remains vulnerable, prompting a call for more robust watermarking techniques. The work highlights practical risks in watermark-based AI detection and informs future defense strategies against watermark-removal threats.

Abstract

Watermark has been widely deployed by industry to detect AI-generated images. A recent watermarking framework called \emph{Stable Signature} (proposed by Meta) roots watermark into the parameters of a diffusion model's decoder such that its generated images are inherently watermarked. Stable Signature makes it possible to watermark images generated by \emph{open-source} diffusion models and was claimed to be robust against removal attacks. In this work, we propose a new attack to remove the watermark from a diffusion model by fine-tuning it. Our results show that our attack can effectively remove the watermark from a diffusion model such that its generated images are non-watermarked, while maintaining the visual quality of the generated images. Our results highlight that Stable Signature is not as stable as previously thought.

Stable Signature is Unstable: Removing Image Watermark from Diffusion Models

TL;DR

and retraining

matches non-watermarked counterparts. Empirically, the attack achieves high evasion rates (

) with low bitwise decoding (

) and competitive Fréchet Inception Distance (

), outperforming the prior MP approach in both encoder-accessible and encoderless settings. The results imply that in-generation watermarking for open-source diffusion models remains vulnerable, prompting a call for more robust watermarking techniques. The work highlights practical risks in watermark-based AI detection and informs future defense strategies against watermark-removal threats.

Abstract

Paper Structure (18 sections, 7 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 18 sections, 7 equations, 9 figures, 1 table, 2 algorithms.

Introduction
Related Works
Latent Diffusion Model
Image Watermark
Watermark Removal Attacks
Problem Formulation
Watermarked Diffusion Model Decoder $D_w$
Threat Model
Our Attack
Overview
Step I: Estimate the Denoised Latent Vector $z$
Step II: Fine-tune the Decoder $D_w$
Evaluation
Experimental Setup
Experimental Results
...and 3 more sections

Figures (9)

Figure 1: An example of image generated by (a) the clean Stable Diffusion 2.1, (b) Stable Diffusion 2.1 watermarked by Stable Signature, (c) watermarked Stable Diffusion 2.1 fine-tuned by MP, (d) watermarked Stable Diffusion 2.1 fine-tuned by our attack with access to the encoder, and (e) watermarked Stable Diffusion 2.1 fine-tuned by our attack without access to the encoder. The same denoised latent vector is used by all diffusion models' decoders to generate the images. The watermark can only be detected in the image generated by (b). The image generated by (c) has significant loss of details.
Figure 2: The main components of a latent diffusion model.
Figure 3: Overview of our attack. The solid arrows represent the direction of data flow and the dashed arrows represent the direction of gradient flow.
Figure 4: Effectiveness and utility of MP and our attack with the three attacking datasets.
Figure 5: Image reconstruction performance for different variants to estimate $z$ on ImageNet. NW denotes the non-watermarked image.
...and 4 more figures

Stable Signature is Unstable: Removing Image Watermark from Diffusion Models

TL;DR

Abstract

Stable Signature is Unstable: Removing Image Watermark from Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)