Table of Contents
Fetching ...

SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

Yang Yang, Xinze Zou, Zehua Ma, Han Fang, Weiming Zhang

TL;DR

SKeDA is a generative watermarking framework tailored for text-to-video diffusion models that transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss.

Abstract

The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.

SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models

TL;DR

SKeDA is a generative watermarking framework tailored for text-to-video diffusion models that transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss.

Abstract

The rise of text-to-video generation models has raised growing concerns over content authenticity, copyright protection, and malicious misuse. Watermarking serves as an effective mechanism for regulating such AI-generated content, where high fidelity and strong robustness are particularly critical. Recent generative image watermarking methods provide a promising foundation by leveraging watermark information and pseudo-random keys to control the initial sampling noise, enabling lossless embedding. However, directly extending these techniques to videos introduces two key limitations: Existing designs implicitly rely on strict alignment between video frames and frame-dependent pseudo-random binary sequences used for watermark encryption. Once this alignment is disrupted, subsequent watermark extraction becomes unreliable; and Video-specific distortions, such as inter-frame compression, significantly degrade watermark reliability. To address these issues, we propose SKeDA, a generative watermarking framework tailored for text-to-video diffusion models. SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation. This design transforms watermark extraction from synchronization-sensitive sequence decoding into permutation-tolerant set-level aggregation, substantially improving robustness against frame reordering and loss; and (2) Differential Attention (DA), which computes inter-frame differences and dynamically adjusts attention weights during extraction, enhancing robustness against temporal distortions. Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
Paper Structure (23 sections, 8 equations, 6 figures, 7 tables)

This paper contains 23 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Application scenarios. The proposed method can detect the source of the model that generated the video, as well as track which user created the video.
  • Figure 2: The framework of SKeDA. Our method consists of four main components: encryption, embedding, extraction, and decryption. In the encryption and embedding stages, the SKe module uses random shuffle key to distribute and rearrange in the latent space to realize the hidden embedding of watermark information without affecting the video quality. In the extraction and decryption stage, the DA module adaptively assigns weights based on the inter-frame difference in the extraction stage, to improve the robustness and retrieval accuracy of the watermark under various distortion conditions.
  • Figure 3: One frame in the video is attacked by different noises. (a) Watermarked frame. (b) 50% Random Crop. (c) Brightness, factor=4. (d) Gaussian Blur, std=2.0. (e) Gaussian Noise, std=0.04. (f) H.264, CRF=30.
  • Figure 4: TPR and bit accuracy under various video distortions and distortion strengths.
  • Figure 5: TPR and bit accuracy under various image distortions and distortion strengths.
  • ...and 1 more figures