Table of Contents
Fetching ...

TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking

Abdullah All Tanvir, Agnibh Dasgupta, Xin Zhong

TL;DR

TIACam is presented, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.

Abstract

Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moiré interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moiré operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.

TIACam: Text-Anchored Invariant Feature Learning with Auto-Augmentation for Camera-Robust Zero-Watermarking

TL;DR

TIACam is presented, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.

Abstract

Camera recapture introduces complex optical degradations, such as perspective warping, illumination shifts, and Moiré interference, that remain challenging for deep watermarking systems. We present TIACam, a text-anchored invariant feature learning framework with auto-augmentation for camera-robust zero-watermarking. The method integrates three key innovations: (1) a learnable auto-augmentor that discovers camera-like distortions through differentiable geometric, photometric, and Moiré operators; (2) a text-anchored invariant feature learner that enforces semantic consistency via cross-modal adversarial alignment between image and text; and (3) a zero-watermarking head that binds binary messages in the invariant feature space without modifying image pixels. This unified formulation jointly optimizes invariance, semantic alignment, and watermark recoverability. Extensive experiments on both synthetic and real-world camera captures demonstrate that TIACam achieves state-of-the-art feature stability and watermark extraction accuracy, establishing a principled bridge between multimodal invariance learning and physically robust zero-watermarking.
Paper Structure (36 sections, 8 equations, 11 figures, 11 tables)

This paper contains 36 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Concept of the proposed TIACam.
  • Figure 1: Example of Visual Gnome Training Data used in TIACam.
  • Figure 2: Overview of the proposed TIACam. Given an input image $x$ and its positive anchor caption $T$ with a negative caption $\tilde{T}$, a distorted image $\hat{x} = \mathcal{T}_{\text{aug}}(x)$ is generated using the learned auto-augmentor $\mathcal{T}_{\text{aug}}(\cdot)$. All inputs are encoded by the CLIP encoders to obtain 768-D features, which are refined by the invariant feature extractor $f_{\theta}(\cdot)$ into 1024-D invariant representations. Paired samples $(f_{\theta}(x), g_{\tau}(T))$, $(f_{\theta}(\hat{x}), g_{\tau}(T))$, and $(f_{\theta}(\hat{x}), g_{\tau}(\tilde{T}))$ are used to train a discriminator $D_{\psi}(\cdot)$ that distinguishes real from fake associations, while $f_{\theta}$ is adversarially optimized both against $D_{\psi}$ for semantic alignment and against $\mathcal{T}_{\text{aug}}$ for robustness. For zero-watermarking, the invariant feature $f_{\theta}(x)$ is projected onto reference codes $C$, and watermark bits are predicted as $\hat{W} = \sigma(f_{\theta}(x)^{\top} C)$ for reliable extraction.
  • Figure 2: Example of Flickr training data used in TIACam. The first row shows the images, the second row contains the corresponding captions, and the third row presents paraphrased versions of those captions.
  • Figure 3: From left to right: example image, its camera-distorted version, and an unrelated negative image.
  • ...and 6 more figures