Table of Contents
Fetching ...

Video Seal: Open and Efficient Video Watermarking

Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, Alexandre Mourachko

TL;DR

Video Seal introduces an open-source, efficient neural video watermarking framework that leverages temporal watermark propagation to avoid watermarking every frame. It jointly trains a lightweight embedder and a ViT-based extractor, supported by multistage training on images and videos with differentiable data augmentations including codecs. The approach demonstrates superior robustness under complex distortions while preserving high perceptual quality, and provides extensive ablations and open artifacts to advance reproducibility. The work highlights practical implications for watermarking AI-generated content and offers a solid foundation for future improvements in payload, robustness, and perceptual fidelity.

Abstract

The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms. Video watermarking addresses these challenges by embedding imperceptible signals into videos, allowing for identification. However, the rare open tools and methods often fall short on efficiency, robustness, and flexibility. To reduce these gaps, this paper introduces Video Seal, a comprehensive framework for neural video watermarking and a competitive open-sourced model. Our approach jointly trains an embedder and an extractor, while ensuring the watermark robustness by applying transformations in-between, e.g., video codecs. This training is multistage and includes image pre-training, hybrid post-training and extractor fine-tuning. We also introduce temporal watermark propagation, a technique to convert any image watermarking model to an efficient video watermarking model without the need to watermark every high-resolution frame. We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness. Video Seal achieves higher robustness compared to strong baselines especially under challenging distortions combining geometric transformations and video compression. Additionally, we provide new insights such as the impact of video compression during training, and how to compare methods operating on different payloads. Contributions in this work - including the codebase, models, and a public demo - are open-sourced under permissive licenses to foster further research and development in the field.

Video Seal: Open and Efficient Video Watermarking

TL;DR

Video Seal introduces an open-source, efficient neural video watermarking framework that leverages temporal watermark propagation to avoid watermarking every frame. It jointly trains a lightweight embedder and a ViT-based extractor, supported by multistage training on images and videos with differentiable data augmentations including codecs. The approach demonstrates superior robustness under complex distortions while preserving high perceptual quality, and provides extensive ablations and open artifacts to advance reproducibility. The work highlights practical implications for watermarking AI-generated content and offers a solid foundation for future improvements in payload, robustness, and perceptual fidelity.

Abstract

The proliferation of AI-generated content and sophisticated video editing tools has made it both important and challenging to moderate digital platforms. Video watermarking addresses these challenges by embedding imperceptible signals into videos, allowing for identification. However, the rare open tools and methods often fall short on efficiency, robustness, and flexibility. To reduce these gaps, this paper introduces Video Seal, a comprehensive framework for neural video watermarking and a competitive open-sourced model. Our approach jointly trains an embedder and an extractor, while ensuring the watermark robustness by applying transformations in-between, e.g., video codecs. This training is multistage and includes image pre-training, hybrid post-training and extractor fine-tuning. We also introduce temporal watermark propagation, a technique to convert any image watermarking model to an efficient video watermarking model without the need to watermark every high-resolution frame. We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness. Video Seal achieves higher robustness compared to strong baselines especially under challenging distortions combining geometric transformations and video compression. Additionally, we provide new insights such as the impact of video compression during training, and how to compare methods operating on different payloads. Contributions in this work - including the codebase, models, and a public demo - are open-sourced under permissive licenses to foster further research and development in the field.

Paper Structure

This paper contains 53 sections, 14 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Overview of digital video watermarking. A binary message is embedded into an original video (e.g., an AI-generated video), producing an imperceptible change in the pixels. This watermarked video may be compressed or edited when saved or shared online. Despite these transformations, the watermark extraction process should retrieve the embedded message. The two primary challenges in this process are (1) the speed of embedding and extraction, which must be computationally efficient to handle the large number of frames in a video, and (2) robustness to common video codecs that often degrade the watermark to the point of being undetectable.
  • Figure 2: Illustration of the embedding process for video watermarking including temporal watermark propagation. To minimize computational overhead, the embedder processes every $k$ frames of the video independently, producing a watermark signal that is copied along the temporal axis to the $k$ neighboring frames. Additionally, the embedding is performed on a downscaled version of the video and the watermark is later upscaled to match the original resolution. This approach helps balance efficiency and robustness.
  • Figure 3: Detailed optimization pipeline of Video Seal. The embedder takes a batch of input images or a sequence of video frames $x$ and random binary messages $m$, and outputs a batch of watermarked images or frames $x_w$. Differentiable transformations are randomly applied to $x_w$ to simulate real-world transmissions, such as crops, brightness changes, or video compression. The extractor then processes these transformed images to estimate the embedded messages $\tilde{m}$. The watermark embedder and extractor are trained jointly to minimize two objectives: the message reconstruction loss and the mean squared error (MSE) between the original images $x$ and the watermarked images $x_w$. Additionally, they are trained to maximize the adversarial loss against a quality discriminator. In a separate optimization step, the quality discriminator $D_q$ itself is trained to distinguish between the watermarked and original images, while keeping the embedder and extractor parameters fixed.
  • Figure 4: Examples of transformations used for robustness evaluation, e.g., in Fig. \ref{['fig:trade-off']} (we show the 20$\textsuperscript{th}$ frame of a 10-second video). We choose H.264 (CRF=30), crop (50% area-wise), brightness with factor 0.5, as representative of video compression codecs, geometric transformations and valuemetric transformations, respectively.
  • Figure 5: Qualitative results for different watermarking methods. Images are from the SA-1b dataset at their original resolution ($\approx$2k $\times$ 1k), and we show more examples in App. \ref{['app:qualitative']}. Although watermarks are imperceptible at first glance, most are visible under close inspection, especially in the flat areas, like the skies in both images. They are also of very different nature between the methods.
  • ...and 7 more figures