Table of Contents
Fetching ...

Privacy-preserving datasets by capturing feature distributions with Conditional VAEs

Francesco Di Salvo, David Tafler, Sebastian Doerrich, Christian Ledig

TL;DR

The paper addresses privacy-sensitive data sharing by replacing raw data with synthetic features generated by a Conditional Variational Autoencoder trained on embeddings from a large foundation model. It shifts generation from the pixel space to the embedding space, enabling a frozen CVAE decoder to sample diverse, privacy-preserving feature vectors, with offline and online anonymization strategies. Across medical and natural image datasets, the CVAE approach outperforms $k$-Same in feature diversity and robustness while preserving downstream utility measured by $AUC$. The work demonstrates a practical route to privacy-preserving, data-efficient learning in settings with data scarcity and privacy constraints, with potential reductions in data exchange and improved resilience to perturbations.

Abstract

Large and well-annotated datasets are essential for advancing deep learning applications, however often costly or impossible to obtain by a single entity. In many areas, including the medical domain, approaches relying on data sharing have become critical to address those challenges. While effective in increasing dataset size and diversity, data sharing raises significant privacy concerns. Commonly employed anonymization methods based on the k-anonymity paradigm often fail to preserve data diversity, affecting model robustness. This work introduces a novel approach using Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models. Foundation models effectively detect and represent complex patterns across diverse domains, allowing the CVAE to faithfully capture the embedding space of a given data distribution to generate (sample) a diverse, privacy-respecting, and potentially unbounded set of synthetic feature vectors. Our method notably outperforms traditional approaches in both medical and natural image domains, exhibiting greater dataset diversity and higher robustness against perturbations while preserving sample privacy. These results underscore the potential of generative models to significantly impact deep learning applications in data-scarce and privacy-sensitive environments. The source code is available at https://github.com/francescodisalvo05/cvae-anonymization .

Privacy-preserving datasets by capturing feature distributions with Conditional VAEs

TL;DR

The paper addresses privacy-sensitive data sharing by replacing raw data with synthetic features generated by a Conditional Variational Autoencoder trained on embeddings from a large foundation model. It shifts generation from the pixel space to the embedding space, enabling a frozen CVAE decoder to sample diverse, privacy-preserving feature vectors, with offline and online anonymization strategies. Across medical and natural image datasets, the CVAE approach outperforms -Same in feature diversity and robustness while preserving downstream utility measured by . The work demonstrates a practical route to privacy-preserving, data-efficient learning in settings with data scarcity and privacy constraints, with potential reductions in data exchange and improved resilience to perturbations.

Abstract

Large and well-annotated datasets are essential for advancing deep learning applications, however often costly or impossible to obtain by a single entity. In many areas, including the medical domain, approaches relying on data sharing have become critical to address those challenges. While effective in increasing dataset size and diversity, data sharing raises significant privacy concerns. Commonly employed anonymization methods based on the k-anonymity paradigm often fail to preserve data diversity, affecting model robustness. This work introduces a novel approach using Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models. Foundation models effectively detect and represent complex patterns across diverse domains, allowing the CVAE to faithfully capture the embedding space of a given data distribution to generate (sample) a diverse, privacy-respecting, and potentially unbounded set of synthetic feature vectors. Our method notably outperforms traditional approaches in both medical and natural image domains, exhibiting greater dataset diversity and higher robustness against perturbations while preserving sample privacy. These results underscore the potential of generative models to significantly impact deep learning applications in data-scarce and privacy-sensitive environments. The source code is available at https://github.com/francescodisalvo05/cvae-anonymization .
Paper Structure (13 sections, 1 equation, 3 figures, 1 table, 1 algorithm)

This paper contains 13 sections, 1 equation, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of our proposed anonymization approach. Given an image dataset $(x_i,y_i) \in \mathcal{X}$ with categorical class distribution $\mathcal{C}$, we first utilize a large pre-trained model to extract and store feature embeddings and corresponding labels $(f_i,y_i) \in \mathcal{F}$. These embeddings capture both local and contextual information while inherently reducing dimensionality. Subsequently, the embeddings are used during training of a Conditional Variational Autoencoder (CVAE) to capture the training distribution conditioned on the respective class labels $y_i$. Finally, we train a task-specific head while dynamically generating new synthetic feature vectors $a_j$ conditioned on class labels $\tilde{y}_j \sim C$ through CVAE's frozen decoder. This not only ensures data anonymity but also increases data diversity and model robustness.
  • Figure 2: Classification performance (AUC) and average nearest neighbor distance ($\mathcal{D}$) on medical (top row) and non-medical (bottom row) datasets. We report in brackets the number of ( training samples,classes). Our objective is to maximize both metrics (top-right corner). The vertical line represents the baseline performance achieved without anonymization.
  • Figure 3: Class distribution of the BreastMNIST dataset and its anonymous counterparts through $k$-Same $(5,10)$ and CVAE. Clearly, while CVAE faithfully preserves data diversity, $k$-Same tends to agglomerate information, increasing data sparsity and losing precious information, especially on limited-size datasets.