Table of Contents
Fetching ...

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

Zilan Wang, Junfeng Guo, Jiacheng Zhu, Yiming Li, Heng Huang, Muhao Chen, Zhengzhong Tu

TL;DR

SleeperMark tackles the problem of protecting IP in pre-trained text-to-image diffusion models under black-box verification by introducing a backdoor-style watermark that is disentangled from semantic knowledge. It uses latent-space watermark pre-training to produce a secret latent residual, then injects a message-embedding backdoor during diffusion fine-tuning via triggered prompts, ensuring watermark extraction remains reliable after downstream tasks. The method extends to pixel diffusion models with a distortion layer and adversarial loss to maintain stealth while preserving fidelity. Extensive experiments show SleeperMark outperforms baselines in fidelity, robustness to fine-tuning (e.g., LoRA, DreamBooth, ControlNet), and stealth, offering a practical tool for black-box ownership verification in real-world scenarios where models are adapted post-release.

Abstract

Recent advances in large-scale text-to-image (T2I) diffusion models have enabled a variety of downstream applications, including style customization, subject-driven personalization, and conditional generation. As T2I models require extensive data and computational resources for training, they constitute highly valued intellectual property (IP) for their legitimate owners, yet making them incentive targets for unauthorized fine-tuning by adversaries seeking to leverage these models for customized, usually profitable applications. Existing IP protection methods for diffusion models generally involve embedding watermark patterns and then verifying ownership through generated outputs examination, or inspecting the model's feature space. However, these techniques are inherently ineffective in practical scenarios when the watermarked model undergoes fine-tuning, and the feature space is inaccessible during verification ((i.e., black-box setting). The model is prone to forgetting the previously learned watermark knowledge when it adapts to a new task. To address this challenge, we propose SleeperMark, a novel framework designed to embed resilient watermarks into T2I diffusion models. SleeperMark explicitly guides the model to disentangle the watermark information from the semantic concepts it learns, allowing the model to retain the embedded watermark while continuing to be adapted to new downstream tasks. Our extensive experiments demonstrate the effectiveness of SleeperMark across various types of diffusion models, including latent diffusion models (e.g., Stable Diffusion) and pixel diffusion models (e.g., DeepFloyd-IF), showing robustness against downstream fine-tuning and various attacks at both the image and model levels, with minimal impact on the model's generative capability. The code is available at https://github.com/taco-group/SleeperMark.

SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image Diffusion Models

TL;DR

SleeperMark tackles the problem of protecting IP in pre-trained text-to-image diffusion models under black-box verification by introducing a backdoor-style watermark that is disentangled from semantic knowledge. It uses latent-space watermark pre-training to produce a secret latent residual, then injects a message-embedding backdoor during diffusion fine-tuning via triggered prompts, ensuring watermark extraction remains reliable after downstream tasks. The method extends to pixel diffusion models with a distortion layer and adversarial loss to maintain stealth while preserving fidelity. Extensive experiments show SleeperMark outperforms baselines in fidelity, robustness to fine-tuning (e.g., LoRA, DreamBooth, ControlNet), and stealth, offering a practical tool for black-box ownership verification in real-world scenarios where models are adapted post-release.

Abstract

Recent advances in large-scale text-to-image (T2I) diffusion models have enabled a variety of downstream applications, including style customization, subject-driven personalization, and conditional generation. As T2I models require extensive data and computational resources for training, they constitute highly valued intellectual property (IP) for their legitimate owners, yet making them incentive targets for unauthorized fine-tuning by adversaries seeking to leverage these models for customized, usually profitable applications. Existing IP protection methods for diffusion models generally involve embedding watermark patterns and then verifying ownership through generated outputs examination, or inspecting the model's feature space. However, these techniques are inherently ineffective in practical scenarios when the watermarked model undergoes fine-tuning, and the feature space is inaccessible during verification ((i.e., black-box setting). The model is prone to forgetting the previously learned watermark knowledge when it adapts to a new task. To address this challenge, we propose SleeperMark, a novel framework designed to embed resilient watermarks into T2I diffusion models. SleeperMark explicitly guides the model to disentangle the watermark information from the semantic concepts it learns, allowing the model to retain the embedded watermark while continuing to be adapted to new downstream tasks. Our extensive experiments demonstrate the effectiveness of SleeperMark across various types of diffusion models, including latent diffusion models (e.g., Stable Diffusion) and pixel diffusion models (e.g., DeepFloyd-IF), showing robustness against downstream fine-tuning and various attacks at both the image and model levels, with minimal impact on the model's generative capability. The code is available at https://github.com/taco-group/SleeperMark.

Paper Structure

This paper contains 66 sections, 9 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: The threat model considered in our work.
  • Figure 2: Illustration of our motivation. We applied WatermarkDM recipe, AquaLoRA aqualora, and our proposed SleeperMark to watermark Stable Diffusion v1.4, followed by fine-tuning on the Naruto dataset naruto_dataset using LoRA lora (rank $=$ 10) for style adaptation. (a) WatermarkDM embeds a watermark image triggered by the specific prompt "[V]," which becomes unrecognizable after fine-tuning approximately 800 steps. (b) AquaLoRA embeds a binary message into generated outputs, but it fails to be extracted after fewer than 100 steps of fine-tuning. (c) Our framework allows for the message to be consistently extracted from outputs generated by triggered prompts, with bit accuracy remaining nearly 1.0 even after 1600 steps of fine-tuning.
  • Figure 3: Pipeline overview for T2I latent diffusion models. (a) In the latent watermark pre-training stage, we jointly train a watermark secret encoder $E_\varphi$ and a secret decoder $D_\gamma$ at the latent level to derive a secret residual $\delta_z^*$. (b) In the stage of fine-tuning diffusion backbone, we leverage the derived $\delta_z^*$ and employ our proposed strategy to inject a message-embedding backdoor into the model, which can be activated by placing a trigger at the start of any prompt. Adversaries may obtain an unauthorized copy of the watermarked model and further fine-tune it for their own tasks. (c) To verify ownership of a suspect model, we extract messages from images generated with triggered prompts, followed by a statistical test to determine if the model is derived from the original watermarked one.
  • Figure 4: Qualitative comparison. The red boxes highlight the artifacts introduced by AquaLoRA. The rightmost two columns show images generated with triggered prompts, where the trigger "*[Z]&" is added at the start of regular prompts to activate certain behavior of the watermarked model.
  • Figure 5: Generation results of watermarked SD v1.4 with our method after fine-tuning across diverse downstream tasks: (a) style adaptation, (b) personalization, (c) additional condition integration. The watermark embedded in the pre-trained SD v1.4 using our method does not impair the model's adaptability to these tasks.
  • ...and 16 more figures