SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Xinjie Zhu; Zijing Zhao; Hui Jin; Qingxiao Guo; Yilong Ma; Yunhao Wang; Xiaobing Guo; Weifeng Zhang

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Xinjie Zhu, Zijing Zhao, Hui Jin, Qingxiao Guo, Yilong Ma, Yunhao Wang, Xiaobing Guo, Weifeng Zhang

TL;DR

SIGMark is proposed, a Scalable In-Generation watermarking framework with blind extraction for video diffusion that achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness.

Abstract

Artificial Intelligence Generated Content (AIGC), particularly video generation with diffusion models, has been advanced rapidly. Invisible watermarking is a key technology for protecting AI-generated videos and tracing harmful content, and thus plays a crucial role in AI safety. Beyond post-processing watermarks which inevitably degrade video quality, recent studies have proposed distortion-free in-generation watermarking for video diffusion models. However, existing in-generation approaches are non-blind: they require maintaining all the message-key pairs and performing template-based matching during extraction, which incurs prohibitive computational costs at scale. Moreover, when applied to modern video diffusion models with causal 3D Variational Autoencoders (VAEs), their robustness against temporal disturbance becomes extremely weak. To overcome these challenges, we propose SIGMark, a Scalable In-Generation watermarking framework with blind extraction for video diffusion. To achieve blind-extraction, we propose to generate watermarked initial noise using a Global set of Frame-wise PseudoRandom Coding keys (GF-PRC), reducing the cost of storing large-scale information while preserving noise distribution and diversity for distortion-free watermarking. To enhance robustness, we further design a Segment Group-Ordering module (SGO) tailored to causal 3D VAEs, ensuring robust watermark inversion during extraction under temporal disturbance. Comprehensive experiments on modern diffusion models show that SIGMark achieves very high bit-accuracy during extraction under both temporal and spatial disturbances with minimal overhead, demonstrating its scalability and robustness. Our project is available at https://jeremyzhao1998.github.io/SIGMark-release/.

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

TL;DR

Abstract

Paper Structure (23 sections, 7 equations, 4 figures, 3 tables)

This paper contains 23 sections, 7 equations, 4 figures, 3 tables.

Introduction
Related works
Diffusion models
Video watermarking
In-generation watermarking for diffusion models
Method
Problem formulation
Framework overview
Watermark embedding
Video generation by modern diffusion models
Global Frame-wise PseudoRandom Coding scheme
Watermark extraction
Segment Group-Ordering module
Message bits extraction
Experiments
...and 8 more sections

Figures (4)

Figure 1: (a) Post-processing watermarks: embedding watermarks in pixel-space which inevitably degrades video quality. (b) Existing in-generation methods: maintaining all the message-key pairs for matching, incurring high extraction costs and poor robustness. (c) Our proposed SIGMark: a blind watermarking framework with global frame-wise PRC keys which is inherently scalable.
Figure 2: Overview of our proposed SIGMark. Embedding: We encode the watermark message into the initial latent noise using a Global set of Frame-wise Pseudo-Random Coding (GF-PRC) keys. The diffusion model then denoises this noise into video frames that carry the embedded messages. Extraction: A (possibly disturbed) video is first processed by our proposed Segment Group-Ordering (SGO) module to recover the correct causal frame grouping, then inverted to obtain the latent noise, from which the message is decoded using the GF-PRC keys. The system stores only the GF-PRC keys for both embedding and extraction, enabling blind watermarking.
Figure 3: Segment Group-Ordering (SGO) module. We set compression ratio $d_t=f/f_l=4$ as an example. When temporal disturbances (e.g., clipping or frame drops) occur, the causal grouping is disrupted; without re-ordering, this leads to incorrectly encoded latent features. Our SGO module restores the correct grouping and ordering, yielding robust latent features for video inversion.
Figure 4: The decoding time cost during watermark extraction.

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

TL;DR

Abstract

SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (4)