Table of Contents
Fetching ...

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

Runyi Hu, Jie Zhang, Ting Xu, Jiwei Li, Tianwei Zhang

TL;DR

Instruction-driven image editing enables rapid semantic changes that can undermine image provenance, posing a risk of fake content. Robust-Wide couples an encoder–noise–decoder watermarking framework with Partial Instruction-driven Denoising Sampling Guidance (PIDSG) to embed watermarks in semantic regions and to recover them after editing. It achieves a low BER of approximately $BER \approx 2.66\%$ for a 64-bit message while maintaining high visual fidelity ($PSNR\approx 40$–$42$ dB, $SSIM\approx 0.99$) and editability, and it generalizes across multiple editing models and distortions. This method advances IP protection for instruction-driven editing tools and provides a practical, transferable solution with broad applicability and public code.

Abstract

Instruction-driven image editing allows users to quickly edit an image according to text instructions in a forward pass. Nevertheless, malicious users can easily exploit this technique to create fake images, which could cause a crisis of trust and harm the rights of the original image owners. Watermarking is a common solution to trace such malicious behavior. Unfortunately, instruction-driven image editing can significantly change the watermarked image at the semantic level, making current state-of-the-art watermarking methods ineffective. To remedy it, we propose Robust-Wide, the first robust watermarking methodology against instruction-driven image editing. Specifically, we follow the classic structure of deep robust watermarking, consisting of the encoder, noise layer, and decoder. To achieve robustness against semantic distortions, we introduce a novel Partial Instruction-driven Denoising Sampling Guidance (PIDSG) module, which consists of a large variety of instruction injections and substantial modifications of images at different semantic levels. With PIDSG, the encoder tends to embed the watermark into more robust and semantic-aware areas, which remains in existence even after severe image editing. Experiments demonstrate that Robust-Wide can effectively extract the watermark from the edited image with a low bit error rate of nearly 2.6% for 64-bit watermark messages. Meanwhile, it only induces a neglectable influence on the visual quality and editability of the original images. Moreover, Robust-Wide holds general robustness against different sampling configurations and other popular image editing methods such as ControlNet-InstructPix2Pix, MagicBrush, Inpainting, and DDIM Inversion. Codes and models are available at https://github.com/hurunyi/Robust-Wide.

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

TL;DR

Instruction-driven image editing enables rapid semantic changes that can undermine image provenance, posing a risk of fake content. Robust-Wide couples an encoder–noise–decoder watermarking framework with Partial Instruction-driven Denoising Sampling Guidance (PIDSG) to embed watermarks in semantic regions and to recover them after editing. It achieves a low BER of approximately for a 64-bit message while maintaining high visual fidelity ( dB, ) and editability, and it generalizes across multiple editing models and distortions. This method advances IP protection for instruction-driven editing tools and provides a practical, transferable solution with broad applicability and public code.

Abstract

Instruction-driven image editing allows users to quickly edit an image according to text instructions in a forward pass. Nevertheless, malicious users can easily exploit this technique to create fake images, which could cause a crisis of trust and harm the rights of the original image owners. Watermarking is a common solution to trace such malicious behavior. Unfortunately, instruction-driven image editing can significantly change the watermarked image at the semantic level, making current state-of-the-art watermarking methods ineffective. To remedy it, we propose Robust-Wide, the first robust watermarking methodology against instruction-driven image editing. Specifically, we follow the classic structure of deep robust watermarking, consisting of the encoder, noise layer, and decoder. To achieve robustness against semantic distortions, we introduce a novel Partial Instruction-driven Denoising Sampling Guidance (PIDSG) module, which consists of a large variety of instruction injections and substantial modifications of images at different semantic levels. With PIDSG, the encoder tends to embed the watermark into more robust and semantic-aware areas, which remains in existence even after severe image editing. Experiments demonstrate that Robust-Wide can effectively extract the watermark from the edited image with a low bit error rate of nearly 2.6% for 64-bit watermark messages. Meanwhile, it only induces a neglectable influence on the visual quality and editability of the original images. Moreover, Robust-Wide holds general robustness against different sampling configurations and other popular image editing methods such as ControlNet-InstructPix2Pix, MagicBrush, Inpainting, and DDIM Inversion. Codes and models are available at https://github.com/hurunyi/Robust-Wide.
Paper Structure (18 sections, 8 equations, 9 figures, 7 tables)

This paper contains 18 sections, 8 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The overall training pipeline of Robust-Wide.
  • Figure 2: Visual results for Robust-Wide. From top to bottom: instructions, original images, normalized residual images, watermarked images, edited images, and the corresponding BERs.
  • Figure 3: Robustness of Robust-Wide against different diffusion sampling configurations.
  • Figure 4: General robustness against other editing methods such as InstructPix2Pix, ControlNet-InstructPix2Pix, MagicBrush, Inpainting, and DDIM Inversion.
  • Figure 5: The influence of continual editing. (a) Some visual examples under continual editing (from left to right). (b) BER increases with more editing rounds. This experiment is conducted on real-world images as mentioned above.
  • ...and 4 more figures