Table of Contents
Fetching ...

Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification

Mischa Dombrowski, Hadrien Reynaud, Bernhard Kainz

TL;DR

The paper addresses privacy and usefulness concerns in latent video diffusion models by training privacy filters in latent space using a VAE-based representation and evaluating subspace faithfulness with a re-identification-based privacy measure. It demonstrates that latent-space privacy models are more efficient and generalize better than their image-space counterparts, while enabling robust assessments of temporal consistency and memory leakage. A key finding is that latent diffusion models may learn only a subset of the training data (up to about 30.8%), which helps explain why downstream tasks trained on synthetic data lag real-data performance. The work provides a practical privacy-preserving diagnostic framework for synthetic medical videos, with implications for safer data sharing and improved evaluation of generative model faithfulness.

Abstract

Latent Video Diffusion Models can easily deceive casual observers and domain experts alike thanks to the produced image quality and temporal consistency. Beyond entertainment, this creates opportunities around safe data sharing of fully synthetic datasets, which are crucial in healthcare, as well as other domains relying on sensitive personal information. However, privacy concerns with this approach have not fully been addressed yet, and models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos, effectively reducing the training data size for downstream models. Additionally, the reduced temporal consistency when generating long videos could be a contributing factor. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better. Furthermore, to investigate downstream degradation factors, we propose to use a re-identification model, previously employed as a privacy preservation filter. We demonstrate that it is sufficient to train this model on the latent space of the video generator. Subsequently, we use these models to evaluate the subspace covered by synthetic video datasets and thus introduce a new way to measure the faithfulness of generative machine learning models. We focus on a specific application in healthcare echocardiography to illustrate the effectiveness of our novel methods. Our findings indicate that only up to 30.8% of the training videos are learned in latent video diffusion models, which could explain the lack of performance when training downstream tasks on synthetic data.

Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification

TL;DR

The paper addresses privacy and usefulness concerns in latent video diffusion models by training privacy filters in latent space using a VAE-based representation and evaluating subspace faithfulness with a re-identification-based privacy measure. It demonstrates that latent-space privacy models are more efficient and generalize better than their image-space counterparts, while enabling robust assessments of temporal consistency and memory leakage. A key finding is that latent diffusion models may learn only a subset of the training data (up to about 30.8%), which helps explain why downstream tasks trained on synthetic data lag real-data performance. The work provides a practical privacy-preserving diagnostic framework for synthetic medical videos, with implications for safer data sharing and improved evaluation of generative model faithfulness.

Abstract

Latent Video Diffusion Models can easily deceive casual observers and domain experts alike thanks to the produced image quality and temporal consistency. Beyond entertainment, this creates opportunities around safe data sharing of fully synthetic datasets, which are crucial in healthcare, as well as other domains relying on sensitive personal information. However, privacy concerns with this approach have not fully been addressed yet, and models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos, effectively reducing the training data size for downstream models. Additionally, the reduced temporal consistency when generating long videos could be a contributing factor. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better. Furthermore, to investigate downstream degradation factors, we propose to use a re-identification model, previously employed as a privacy preservation filter. We demonstrate that it is sufficient to train this model on the latent space of the video generator. Subsequently, we use these models to evaluate the subspace covered by synthetic video datasets and thus introduce a new way to measure the faithfulness of generative machine learning models. We focus on a specific application in healthcare echocardiography to illustrate the effectiveness of our novel methods. Our findings indicate that only up to 30.8% of the training videos are learned in latent video diffusion models, which could explain the lack of performance when training downstream tasks on synthetic data.

Paper Structure

This paper contains 11 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: We propose leveraging Variational Autoencoders (VAEs), which were trained to enable the training of video diffusion models, to enhance the computational efficiency of models used for privacy filtering. Additionally, we demonstrate how the trained filter model can be applied to various other tasks, such as evaluating temporal consistency and model recall.
  • Figure 2: Overview of our approach: We take the LIDM and LVDM from reynaud2024echonet (left) for video generation. Our latent privacy model is based on packhauser2022deep. We show, that privacy regularization methods are more reliable in latent than in image space. Then we use the privacy model to evaluate temporal consistency, and generative model recall, which is a measure of how many of the training images are learned by the model without raising privacy issues.
  • Figure 3: Confusion matrices for the different datasets computed in image and latent space.
  • Figure 4: $\mathcal{P}_{\text{max}}$ values of the distances between the training dataset and real test samples and between the training dataset and the synthetic videos. The black line corresponds to the 95th percentile privacy threshold.
  • Figure 5: t-SNE plot of training and synthetic image datasets. We visualize the t-SNE components of the learned representations extracted by the privacy filter model. Synthetic samples are shown in orange and training images are shown in light blue. For every synthetic image, we apply privacy filtering. If a training image is closer to a synthetic image than the synthetic image is to all other training images, we consider the training image learned and change the color of the training image in the plot to dark blue.
  • ...and 1 more figures