Text-Guided Image Invariant Feature Learning for Robust Image Watermarking
Muhammad Ahtesham, Xin Zhong
TL;DR
The paper tackles robust image watermarking under transformations by learning invariant features guided by text semantics. It introduces a text-guided invariant feature learning framework that leverages CLIP's image/text encoders and a 4096-d projector, optimized with a contrastive loss plus a decorrelation term, enforcing the objective $L_{total} = L_{pos} + L_{neg} + \lambda_{decorr} L_{decorr}$. Evaluations on Flickr8k, Flickr30k, OxfordPet, and STL10 show higher cosine similarity between original and distorted features and improved watermark extraction accuracy compared with SSL-based methods like SimCLR, BYOL, and DINO. The results indicate that grounding image representations in textual semantics yields robust watermarking suitable for real-world digital rights management and content authentication.
Abstract
Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.
