Table of Contents
Fetching ...

Video Signature: Implicit Watermarking for Video Diffusion Models

Yu Huang, Junhao Chen, Shuliang Liu, Hanqian Li, Jungang Li, Qi Zheng, Aiwei Liu, Yi R. Fung, Xuming Hu

TL;DR

VidSig introduces implicit watermarking for video diffusion models by fine-tuning a subset of the latent decoder to embed multibit watermarks during video generation. It combines Perturbation-Aware Suppression (PAS) to pre-identify perceptually sensitive layers and a Temporal Alignment (TA) module to enforce inter-frame coherence, achieving high watermark extraction accuracy with minimal perceptual loss. The method outperforms post-generation baselines and naively extended image-based in-generation approaches in both watermark reliability and video quality, while also offering low latency and robust tamper resistance, including across different frame counts and resolutions and transferability to new models. Practically, VidSig provides a scalable, plug-in solution for ownership verification and provenance tracking of AI-generated videos in real-world deployment.

Abstract

The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation, but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, yet existing methods for video generation mainly follow a post-generation paradigm, which often fails to effectively balance the trade-off between video quality and watermark extraction. Meanwhile, current in-generation methods that embed the watermark into the initial Gaussian noise usually incur substantial additional computation. To address these issues, we propose \textbf{Video Signature} (\textsc{VidSig}), an implicit watermarking method for video diffusion models that enables imperceptible and adaptive watermark integration during video generation with almost no extra latency. Specifically, we partially fine-tune the latent decoder, where \textbf{Perturbation-Aware Suppression} (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight \textbf{Temporal Alignment} module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that \textsc{VidSig} achieves the best trade-off among watermark extraction accuracy, video quality, and watermark latency. It also demonstrates strong robustness against both spatial and temporal tamper, and remains stable across different video lengths and resolutions, highlighting its practicality in real-world scenarios.

Video Signature: Implicit Watermarking for Video Diffusion Models

TL;DR

VidSig introduces implicit watermarking for video diffusion models by fine-tuning a subset of the latent decoder to embed multibit watermarks during video generation. It combines Perturbation-Aware Suppression (PAS) to pre-identify perceptually sensitive layers and a Temporal Alignment (TA) module to enforce inter-frame coherence, achieving high watermark extraction accuracy with minimal perceptual loss. The method outperforms post-generation baselines and naively extended image-based in-generation approaches in both watermark reliability and video quality, while also offering low latency and robust tamper resistance, including across different frame counts and resolutions and transferability to new models. Practically, VidSig provides a scalable, plug-in solution for ownership verification and provenance tracking of AI-generated videos in real-world deployment.

Abstract

The rapid development of Artificial Intelligence Generated Content (AIGC) has led to significant progress in video generation, but also raises serious concerns about intellectual property protection and reliable content tracing. Watermarking is a widely adopted solution to this issue, yet existing methods for video generation mainly follow a post-generation paradigm, which often fails to effectively balance the trade-off between video quality and watermark extraction. Meanwhile, current in-generation methods that embed the watermark into the initial Gaussian noise usually incur substantial additional computation. To address these issues, we propose \textbf{Video Signature} (\textsc{VidSig}), an implicit watermarking method for video diffusion models that enables imperceptible and adaptive watermark integration during video generation with almost no extra latency. Specifically, we partially fine-tune the latent decoder, where \textbf{Perturbation-Aware Suppression} (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality. Beyond spatial fidelity, we further enhance temporal consistency by introducing a lightweight \textbf{Temporal Alignment} module that guides the decoder to generate coherent frame sequences during fine-tuning. Experimental results show that \textsc{VidSig} achieves the best trade-off among watermark extraction accuracy, video quality, and watermark latency. It also demonstrates strong robustness against both spatial and temporal tamper, and remains stable across different video lengths and resolutions, highlighting its practicality in real-world scenarios.

Paper Structure

This paper contains 47 sections, 18 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Comparison between Video Signature and other watermarking methods in watermark extraction accuracy, video quality (measured by VBench huang2024vbench) and watermark latency. The watermark latency is the summation of the latency for watermark embedding and extraction. The values of the metrics are the average of T2V model and I2V model, detailed information see Section \ref{['section: EXP']}.
  • Figure 2: The training pipeline of Video Signature. (1) Given an input video, we first encode it into a latent representation and decode it with a frozen latent decoder. (2) Before optimization, the proposed PAS module searches the most perceptually sensitive layers and freezes them. (3) The watermarked decoder $\mathcal{D}$' is then optimized to embed a secret key into the generated video with three different objectives: pixel-level alignment, inter-frame level alignment, and bit match.
  • Figure 3: Watermark detection of Video Signature.
  • Figure 4: Bit accuracy under different spatial tampering. The attack is applied to each frame.
  • Figure 5: Extraction Accuracy and Video Quality versus Frames and Resolution. Resolution $= N$ denotes to a resolution of $N \times N$.
  • ...and 8 more figures