Table of Contents
Fetching ...

Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

Tânia Carvalho, Nuno Moniz, Luís Antunes, Nitesh Chawla

TL;DR

The paper addresses re-identification risk in sharing tabular data by proposing $ε$-PrivateSMOTE, a privacy-preserving synthesis method that fuses noise-induced interpolation with $ε$-differential privacy via the Laplace mechanism. It targets high-risk records by selectively substituting them with synthetic neighbors, achieving competitive privacy risk while maintaining predictive performance relative to GAN/VAE and traditional DP baselines, and reporting at least a 9x speedup in runtime on CPU. Empirical evaluation across 15 OpenML datasets shows that larger $ε$ improves utility and reduces linkability, albeit with some trade-offs at very low $ε$ values; the method also generates more variants and is markedly more resource-efficient than deep-learning and DP-based approaches. The work demonstrates a practical, scalable approach for privacy-preserving data sharing in tabular contexts, while noting limitations such as potential inference attacks, outlier obfuscation, and scope restricted to tabular data, pointing to future work on broader data modalities and robustness.

Abstract

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose $ε$-PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high \sloppy re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how $ε$-PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.

Differentially-Private Data Synthetisation for Efficient Re-Identification Risk Control

TL;DR

The paper addresses re-identification risk in sharing tabular data by proposing -PrivateSMOTE, a privacy-preserving synthesis method that fuses noise-induced interpolation with -differential privacy via the Laplace mechanism. It targets high-risk records by selectively substituting them with synthetic neighbors, achieving competitive privacy risk while maintaining predictive performance relative to GAN/VAE and traditional DP baselines, and reporting at least a 9x speedup in runtime on CPU. Empirical evaluation across 15 OpenML datasets shows that larger improves utility and reduces linkability, albeit with some trade-offs at very low values; the method also generates more variants and is markedly more resource-efficient than deep-learning and DP-based approaches. The work demonstrates a practical, scalable approach for privacy-preserving data sharing in tabular contexts, while noting limitations such as potential inference attacks, outlier obfuscation, and scope restricted to tabular data, pointing to future work on broader data modalities and robustness.

Abstract

Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose -PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high \sloppy re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation with differential privacy principles to obfuscate high-risk cases. We demonstrate how -PrivateSMOTE is capable of achieving competitive results in privacy risk and better predictive performance when compared to multiple traditional and state-of-the-art privacy-preservation methods, including generative adversarial networks, variational autoencoders, and differential privacy baselines. We also show how our method improves time requirements by at least a factor of 9 and is a resource-efficient solution that ensures high performance without specialised hardware.
Paper Structure (14 sections, 2 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Methodology of the experimental evaluation.
  • Figure 2: Relationship between privacy risk and predictive performance of all data variants produced with $\epsilon$-PrivateSMOTE for each $\epsilon$.
  • Figure 3: Data utility measures for each $\epsilon$ across all data variants using $\epsilon$-PrivateSMOTE.
  • Figure 4: Best predictive performance results and corresponding privacy risk (blue) and best privacy risk results and corresponding predictive performance results for each transformation technique (red).
  • Figure 5: Comparison between the best-estimated hyperparameter configuration per transformation technique and the oracle configuration. Illustrates the proportion of probability for each candidate solution drawing or losing significantly against the oracle according to the Bayes Sign Test for predictive performance.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1: Re-identification
  • Definition 2: Highest-risk selection
  • Definition 3: Laplace mechanism
  • Definition 4: Linkability giomi2022unified