Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification
Mischa Dombrowski, Hadrien Reynaud, Bernhard Kainz
TL;DR
The paper addresses privacy and usefulness concerns in latent video diffusion models by training privacy filters in latent space using a VAE-based representation and evaluating subspace faithfulness with a re-identification-based privacy measure. It demonstrates that latent-space privacy models are more efficient and generalize better than their image-space counterparts, while enabling robust assessments of temporal consistency and memory leakage. A key finding is that latent diffusion models may learn only a subset of the training data (up to about 30.8%), which helps explain why downstream tasks trained on synthetic data lag real-data performance. The work provides a practical privacy-preserving diagnostic framework for synthetic medical videos, with implications for safer data sharing and improved evaluation of generative model faithfulness.
Abstract
Latent Video Diffusion Models can easily deceive casual observers and domain experts alike thanks to the produced image quality and temporal consistency. Beyond entertainment, this creates opportunities around safe data sharing of fully synthetic datasets, which are crucial in healthcare, as well as other domains relying on sensitive personal information. However, privacy concerns with this approach have not fully been addressed yet, and models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos, effectively reducing the training data size for downstream models. Additionally, the reduced temporal consistency when generating long videos could be a contributing factor. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better. Furthermore, to investigate downstream degradation factors, we propose to use a re-identification model, previously employed as a privacy preservation filter. We demonstrate that it is sufficient to train this model on the latent space of the video generator. Subsequently, we use these models to evaluate the subspace covered by synthetic video datasets and thus introduce a new way to measure the faithfulness of generative machine learning models. We focus on a specific application in healthcare echocardiography to illustrate the effectiveness of our novel methods. Our findings indicate that only up to 30.8% of the training videos are learned in latent video diffusion models, which could explain the lack of performance when training downstream tasks on synthetic data.
