Table of Contents
Fetching ...

Text-Guided Image Invariant Feature Learning for Robust Image Watermarking

Muhammad Ahtesham, Xin Zhong

TL;DR

The paper tackles robust image watermarking under transformations by learning invariant features guided by text semantics. It introduces a text-guided invariant feature learning framework that leverages CLIP's image/text encoders and a 4096-d projector, optimized with a contrastive loss plus a decorrelation term, enforcing the objective $L_{total} = L_{pos} + L_{neg} + \lambda_{decorr} L_{decorr}$. Evaluations on Flickr8k, Flickr30k, OxfordPet, and STL10 show higher cosine similarity between original and distorted features and improved watermark extraction accuracy compared with SSL-based methods like SimCLR, BYOL, and DINO. The results indicate that grounding image representations in textual semantics yields robust watermarking suitable for real-world digital rights management and content authentication.

Abstract

Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.

Text-Guided Image Invariant Feature Learning for Robust Image Watermarking

TL;DR

The paper tackles robust image watermarking under transformations by learning invariant features guided by text semantics. It introduces a text-guided invariant feature learning framework that leverages CLIP's image/text encoders and a 4096-d projector, optimized with a contrastive loss plus a decorrelation term, enforcing the objective . Evaluations on Flickr8k, Flickr30k, OxfordPet, and STL10 show higher cosine similarity between original and distorted features and improved watermark extraction accuracy compared with SSL-based methods like SimCLR, BYOL, and DINO. The results indicate that grounding image representations in textual semantics yields robust watermarking suitable for real-world digital rights management and content authentication.

Abstract

Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.

Paper Structure

This paper contains 14 sections, 6 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The proposed text-guided invariant feature learning.
  • Figure 2: Example tolerance of the proposed on different noise levels.
  • Figure 3: Distortions used in our watermarking analysis