Table of Contents
Fetching ...

Robust Invisible Video Watermarking with Attention

Kevin Alex Zhang, Lei Xu, Alfredo Cuesta-Infante, Kalyan Veeramachaneni

TL;DR

The paper tackles robust, invisible video watermarking by introducing RivaGAN, an end-to-end architecture that uses a per-pixel attention mechanism to embed a D-bit watermark into video frames while jointly trained with a critic and an adversary to ensure video quality and watermark robustness. It augments the encoder–decoder pair with an attention module that guides bit embedding at the pixel level, and employs differentiable noise layers simulating scaling, cropping, and compression to enforce resilience. Across experiments on the Hollywood2 dataset, the method achieves high decoding accuracy with minimal perceptual distortion, outperforming concatenation-based baselines and demonstrating resilience to common video processing operations. These results, along with analyses of bit-level influence and temporal consistency, indicate practical viability for secure, blind watermark recovery, and the authors provide public code for replication.

Abstract

The goal of video watermarking is to embed a message within a video file in a way such that it minimally impacts the viewing experience but can be recovered even if the video is redistributed and modified, allowing media producers to assert ownership over their content. This paper presents RivaGAN, a novel architecture for robust video watermarking which features a custom attention-based mechanism for embedding arbitrary data as well as two independent adversarial networks which critique the video quality and optimize for robustness. Using this technique, we are able to achieve state-of-the-art results in deep learning-based video watermarking and produce watermarked videos which have minimal visual distortion and are robust against common video processing operations.

Robust Invisible Video Watermarking with Attention

TL;DR

The paper tackles robust, invisible video watermarking by introducing RivaGAN, an end-to-end architecture that uses a per-pixel attention mechanism to embed a D-bit watermark into video frames while jointly trained with a critic and an adversary to ensure video quality and watermark robustness. It augments the encoder–decoder pair with an attention module that guides bit embedding at the pixel level, and employs differentiable noise layers simulating scaling, cropping, and compression to enforce resilience. Across experiments on the Hollywood2 dataset, the method achieves high decoding accuracy with minimal perceptual distortion, outperforming concatenation-based baselines and demonstrating resilience to common video processing operations. These results, along with analyses of bit-level influence and temporal consistency, indicate practical viability for secure, blind watermark recovery, and the authors provide public code for replication.

Abstract

The goal of video watermarking is to embed a message within a video file in a way such that it minimally impacts the viewing experience but can be recovered even if the video is redistributed and modified, allowing media producers to assert ownership over their content. This paper presents RivaGAN, a novel architecture for robust video watermarking which features a custom attention-based mechanism for embedding arbitrary data as well as two independent adversarial networks which critique the video quality and optimize for robustness. Using this technique, we are able to achieve state-of-the-art results in deep learning-based video watermarking and produce watermarked videos which have minimal visual distortion and are robust against common video processing operations.

Paper Structure

This paper contains 9 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: This figure shows the difference between what related deep learning-based approaches (left) to this task use to represent their data and what our attention-based approach (right) uses. Unlike existing approaches which naively repeat the data cross the spatial dimensions, we learn a probability distribution over the data for each pixel (e.g. the attention distribution) and use that to generate a more compact data representation. This operation also has the advantage of being interpretable as an "attention mask" as we can see what bits each pixel is paying attention to and encourage the model to pay attention to different bits based on the content of the image.
  • Figure 2: This figure shows how the attention, encoder, and decoder modules operate on a tensor level. The attention module uses two convolutional blocks to create an attention mask, which is then used by the encoder and decoder modules to determine which bits to pay attention to at each pixel. The encoder module uses the attention mask to compute a compacted form of the data tensor and concatenates it to the image before applying additional convolutional blocks to generate the watermarked video. The decoder module extracts the data from each pixel but then weights the prediction using the attention mask before averaging to try and recover the original data.
  • Figure 3: This figure shows the watermarked video (top) and the residual masks (bottom). The residual masks were generated by the encoder module and added to the source video to produce the watermarked video.
  • Figure 4: This figure shows the original source video and two examples "difference masks" for the first and second bit of the data tensor. Bright regions indicate that flipping a single bit caused that pixel to change in the watermarked output. The three images on the top correspond to a model trained with the attention mechanism and we note that the two difference masks look significantly different. The three images on the bottom correspond to a model trained without the attention mechanism and the two difference masks are virtually identical.
  • Figure 5: This figure shows the training loss for the same model architecture, learning rate, and optimizer but trained with and without the bit inverse trick. We find that including the bit inverse within the same batch results in dramatically faster convergence as well as better model performance.