Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
Ke Liu, Xuanhan Wang, Qilong Zhang, Lianli Gao, Jingkuan Song
TL;DR
The paper tackles the challenge of learning based image watermarking that is simultaneously invisible, robust, and broadly applicable. It introduces Hierarchical Watermark Learning (HiWL), a two stage framework consisting of distribution alignment to fuse watermark messages with cover images in a latent space, and generalized watermark representation learning via RGB residuals to enable one shot embedding across diverse images. Empirical results show HiWL achieves about 7.6% higher watermark extraction accuracy than prior methods and can process 1000 images in 1 second, while maintaining high invisibility (PSNR around $37.86$ dB, SSIM around $0.969$) and strong robustness under 18 distortion types and cross domain transfers. This two stage design provides a scalable, low latency solution for practical watermarking with broad applicability across datasets and transformation scenarios.
Abstract
Deep image watermarking, which refers to enabling imperceptible watermark embedding and reliable extraction in cover images, has been shown to be effective for copyright protection of image assets. However, existing methods face limitations in simultaneously satisfying three essential criteria for generalizable watermarking: (1) invisibility (imperceptible hiding of watermarks), (2) robustness (reliable watermark recovery under diverse conditions), and (3) broad applicability (low latency in the watermarking process). To address these limitations, we propose a Hierarchical Watermark Learning (HiWL) framework, a two-stage optimization that enables a watermarking model to simultaneously achieve all three criteria. In the first stage, distribution alignment learning is designed to establish a common latent space with two constraints: (1) visual consistency between watermarked and non-watermarked images, and (2) information invariance across watermark latent representations. In this way, multimodal inputs -- including watermark messages (binary codes) and cover images (RGB pixels) -- can be effectively represented, ensuring both the invisibility of watermarks and robustness in the watermarking process. In the second stage, we employ generalized watermark representation learning to separate a unique representation of the watermark from the marked image in RGB space. Once trained, the HiWL model effectively learns generalizable watermark representations while maintaining broad applicability. Extensive experiments demonstrate the effectiveness of the proposed method. Specifically, it achieves 7.6% higher accuracy in watermark extraction compared to existing methods, while maintaining extremely low latency (processing 1000 images in 1 second).
