Table of Contents
Fetching ...

LVMark: Robust Watermark for Latent Video Diffusion Models

MinHyuk Jang, Youngdong Jang, JaeHyeok Lee, Feng Yang, Gyeongrok Oh, Jongheon Jeong, Sangpil Kim

TL;DR

LVMark tackles ownership protection for video diffusion models by embedding imperceptible watermarks that remain robust under video distortions and model attacks. It fuses low-frequency information from a 3D discrete wavelet transform with RGB video features using cross-attention, and embeds watermarks by selectively modulating a subset of latent-decoder weights. A distortion layer and a composite training loss balance visual quality and bit accuracy, achieving up to 512-bit capacity with robust decoding. Empirically, LVMark outperforms existing approaches in temporal consistency and robustness, enabling reliable ownership tracking without compromising video fidelity.

Abstract

Rapid advancements in video diffusion models have enabled the creation of realistic videos, raising concerns about unauthorized use and driving the demand for techniques to protect model ownership. Existing watermarking methods, while effective for image diffusion models, do not account for temporal consistency, leading to degraded video quality and reduced robustness against video distortions. To address this issue, we introduce LVMark, a novel watermarking method for video diffusion models. We propose a new watermark decoder tailored for generated videos by learning the consistency between adjacent frames. It ensures accurate message decoding, even under malicious attacks, by combining the low-frequency components of the 3D wavelet domain with the RGB features of the video. Additionally, our approach minimizes video quality degradation by embedding watermark messages in layers with minimal impact on visual appearance using an importance-based weight modulation strategy. We optimize both the watermark decoder and the latent decoder of diffusion model, effectively balancing the trade-off between visual quality and bit accuracy. Our experiments show that our method embeds invisible watermarks into video diffusion models, ensuring robust decoding accuracy with 512-bit capacity, even under video distortions.

LVMark: Robust Watermark for Latent Video Diffusion Models

TL;DR

LVMark tackles ownership protection for video diffusion models by embedding imperceptible watermarks that remain robust under video distortions and model attacks. It fuses low-frequency information from a 3D discrete wavelet transform with RGB video features using cross-attention, and embeds watermarks by selectively modulating a subset of latent-decoder weights. A distortion layer and a composite training loss balance visual quality and bit accuracy, achieving up to 512-bit capacity with robust decoding. Empirically, LVMark outperforms existing approaches in temporal consistency and robustness, enabling reliable ownership tracking without compromising video fidelity.

Abstract

Rapid advancements in video diffusion models have enabled the creation of realistic videos, raising concerns about unauthorized use and driving the demand for techniques to protect model ownership. Existing watermarking methods, while effective for image diffusion models, do not account for temporal consistency, leading to degraded video quality and reduced robustness against video distortions. To address this issue, we introduce LVMark, a novel watermarking method for video diffusion models. We propose a new watermark decoder tailored for generated videos by learning the consistency between adjacent frames. It ensures accurate message decoding, even under malicious attacks, by combining the low-frequency components of the 3D wavelet domain with the RGB features of the video. Additionally, our approach minimizes video quality degradation by embedding watermark messages in layers with minimal impact on visual appearance using an importance-based weight modulation strategy. We optimize both the watermark decoder and the latent decoder of diffusion model, effectively balancing the trade-off between visual quality and bit accuracy. Our experiments show that our method embeds invisible watermarks into video diffusion models, ensuring robust decoding accuracy with 512-bit capacity, even under video distortions.

Paper Structure

This paper contains 17 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of video generation and ownership identification.Top: An authorized user generates a watermarked video. Bottom: Despite distortions applied to the video or model, the watermark decoder can reliably identify the video's owner.
  • Figure 2: Training pipeline. The training pipeline of our method is illustrated here. Top: We fine-tune the latent decoder to embed binary messages in generated videos and train the watermark decoder to retrieve messages from distorted videos. Bottom-left: We modulate layers of the latent decoder that minimally impact visual quality to embed random messages. Bottom-right: The watermark decoder combines the RGB video with low-frequency subbands from a 3D wavelet transform using cross-attention to decode the binary message.
  • Figure 3: Qualitative results with baselines. We show the visual quality of generated videos with baseline watermarking methods. The first row shows the crop of each image, while the second and third rows show the video frame itself and the difference map($\times$10) between the original and each method. Note that, unlike other approaches, VideoShield hu2025videoshield embeds watermarks in the latent noise of the diffusion model, generating a video different from the original.
  • Figure 4: The impact of weight modulation rate. Each point represents the metrics for different modulation rates: 0%, 25%, 50%, 75%, and 100%.
  • Figure 5: Visual impact of weighted patch loss. We visualize a video frame trained with and without weighted patch loss. To enhance clarity, we crop the local regions containing artifacts. Both results used VGG Loss czolbe2020loss with the default setting.