Mask Image Watermarking
Runyi Hu, Jie Zhang, Shiqian Zhao, Nils Lukas, Jiwei Li, Qing Guo, Han Qiu, Tianwei Zhang
TL;DR
MaskWM introduces a masking-based framework for image watermarking that enables robust local watermark extraction and precise localization while preserving high visual fidelity. By training with masks that guide where watermarks are embedded and extracted, MaskWM-D achieves global embedding with local extraction, and MaskWM-ED enables true local embedding and robust regional protection. The approach delivers state-of-the-art performance in global and local watermarking, localization accuracy, and scalability to multiple watermarks and longer bit lengths, with significantly reduced training compute (≈20 hours on a single A6000) compared to prior methods like WAM. Its design supports fast fine-tuning for adaptive threats and demonstrates strong robustness to a broad suite of distortions, making it practical for real-world image provenance, integrity, and region-specific watermarking tasks.
Abstract
We present MaskWM, a simple, efficient, and flexible framework for image watermarking. MaskWM has two variants: (1) MaskWM-D, which supports global watermark embedding, watermark localization, and local watermark extraction for applications such as tamper detection; (2) MaskWM-ED, which focuses on local watermark embedding and extraction, offering enhanced robustness in small regions to support fine-grined image protection. MaskWM-D builds on the classical encoder-distortion layer-decoder training paradigm. In MaskWM-D, we introduce a simple masking mechanism during the decoding stage that enables both global and local watermark extraction. During training, the decoder is guided by various types of masks applied to watermarked images before extraction, helping it learn to localize watermarks and extract them from the corresponding local areas. MaskWM-ED extends this design by incorporating the mask into the encoding stage as well, guiding the encoder to embed the watermark in designated local regions, which improves robustness under regional attacks. Extensive experiments show that MaskWM achieves state-of-the-art performance in global and local watermark extraction, watermark localization, and multi-watermark embedding. It outperforms all existing baselines, including the recent leading model WAM for local watermarking, while preserving high visual quality of the watermarked images. In addition, MaskWM is highly efficient and adaptable. It requires only 20 hours of training on a single A6000 GPU, achieving 15x computational efficiency compared to WAM. By simply adjusting the distortion layer, MaskWM can be quickly fine-tuned to meet varying robustness requirements.
