Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal
Haonan An, Guang Hua, Zhengru Fang, Guowen Xu, Susanto Rahardja, Yuguang Fang
TL;DR
This work identifies a vulnerability in box-free watermarking where the watermark decoder can be exploited by a gradient-based attacker to train a remover that eliminates the watermark. It introduces Decoder Gradient Shield (DGS), a closed-form defense that reorients and scales the gradient of watermarked queries through a positive definite matrix $P$, yielding the relation $Z^* = -P Z + (P+I)W$ and enabling a protected API that preserves decoder function while hindering learning of removal. The authors provide a detailed threat model, derive the gradient-based attack, and demonstrate through deraining and style transfer experiments that DGS prevents watermark removal without sacrificing output fidelity, showing robustness to common post-processing attacks. The findings offer a practical IP protection mechanism for box-free watermarking in image-to-image models and identify avenues for future work on countering reverse engineering of the defense.
Abstract
The intellectual property of deep image-to-image models can be protected by the so-called box-free watermarking. It uses an encoder and a decoder, respectively, to embed into and extract from the model's output images invisible copyright marks. Prior works have improved watermark robustness, focusing on the design of better watermark encoders. In this paper, we reveal an overlooked vulnerability of the unprotected watermark decoder which is jointly trained with the encoder and can be exploited to train a watermark removal network. To defend against such an attack, we propose the decoder gradient shield (DGS) as a protection layer in the decoder API to prevent gradient-based watermark removal with a closed-form solution. The fundamental idea is inspired by the classical adversarial attack, but is utilized for the first time as a defensive mechanism in the box-free model watermarking. We then demonstrate that DGS can reorient and rescale the gradient directions of watermarked queries and stop the watermark remover's training loss from converging to the level without DGS, while retaining decoder output image quality. Experimental results verify the effectiveness of proposed method. Code of paper will be made available upon acceptance.
