Table of Contents
Fetching ...

Investigating Memorization in Video Diffusion Models

Chen Chen, Enhuai Liu, Daochang Liu, Mubarak Shah, Chang Xu

TL;DR

This work addresses the privacy risk of memorization in video diffusion models (VDMs) by formulating disentangled definitions for content memorization and motion memorization, and by introducing dedicated metrics to quantify each type. It develops Generalized SSCD (GSSCD) for frame-level content memorization and Optical Flow Similarity (OFS-k) for motion memorization, augmented with Natural Motion Filtering (NMF) to discount natural motions. A systematic, prompt-driven analysis across open-source VDMs on WebVid-10M reveals widespread memorization of training data, including image backbones, highlighting privacy risks even in open-source models. The authors propose inference-time detection strategies that leverage content and motion magnitudes to efficiently flag memorized outputs, offering a practical foundation for privacy-preserving VDM deployment and future improvements.

Abstract

Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

Investigating Memorization in Video Diffusion Models

TL;DR

This work addresses the privacy risk of memorization in video diffusion models (VDMs) by formulating disentangled definitions for content memorization and motion memorization, and by introducing dedicated metrics to quantify each type. It develops Generalized SSCD (GSSCD) for frame-level content memorization and Optical Flow Similarity (OFS-k) for motion memorization, augmented with Natural Motion Filtering (NMF) to discount natural motions. A systematic, prompt-driven analysis across open-source VDMs on WebVid-10M reveals widespread memorization of training data, including image backbones, highlighting privacy risks even in open-source models. The authors propose inference-time detection strategies that leverage content and motion magnitudes to efficiently flag memorized outputs, offering a practical foundation for privacy-preserving VDM deployment and future improvements.

Abstract

Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this gap, we first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way that focuses on privacy preservation and applies to all generation types. We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. Additionally, we curate a dataset of text prompts that are most prone to triggering memorization when used as conditioning in VDMs. By leveraging these prompts, we generate diverse videos from various open-source VDMs, successfully extracting numerous training videos from each tested model. Through the application of our proposed metrics, we systematically analyze memorization across various pretrained VDMs, including text-conditional and unconditional models, on a variety of datasets. Our comprehensive study reveals that memorization is widespread across all tested VDMs, indicating that VDMs can also memorize image training data in addition to video datasets. Finally, we propose efficient and effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

Paper Structure

This paper contains 26 sections, 8 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Video training dataset (WebVid-10M) being extracted by several open-source text-conditional VDMs. The generated videos can memorize both content and motion of the training videos. For each example, the top shows the training video and the bottom shows the generated video.
  • Figure 2: Motion memorization detected using OFS-3. The left side shows consecutive frame pairs from both the training video and the generated video using ModelScope, while the right side visualizes the optical flow between consecutive frames, computed using RAFT. OFS-3 calculates the average cosine similarity across three frame pairs, effectively capturing motion memorization. The OFS-3 scores for all four examples are above 0.5: 0.9012, 0.9393, 0.8523, and 0.8806, thus classified as motion-memorized cases.
  • Figure 3: Image training dataset (LAION) being extracted by ModelScopeT2V. For each qualitative example, the left shows images from the training set, while the right displays the most similar frames from the generated videos according to GSSCD.
  • Figure 4: Video training dataset (UCF-101) being extracted by RaMViD's unconditional generation.
  • Figure 5: Video training dataset (WebVid-10M) being extracted by ModelScope.
  • ...and 4 more figures