Table of Contents
Fetching ...

Training Set Reconstruction from Differentially Private Forests: How Effective is DP?

Alice Gorgé, Julien Ferry, Sébastien Gambs, Thibaut Vidal

TL;DR

The paper investigates whether training data in differentially private random forests can be reconstructed. It introduces a constraint-programming–based reconstruction attack that leverages the forest structure and DP noise to recover a most likely training dataset, and it evaluates this approach on three tabular datasets across multiple DP budgets and forest configurations. The findings reveal that, although DP attenuates leakage, meaningful reconstruction of training data remains possible for non-trivial $\varepsilon$ values, with leakage often extending to dataset-specific details rather than mere distributional patterns; only extremely sparse, trivially predictive forests appear robust. The work highlights the need for careful DP mechanism design and hyperparameter tuning to balance privacy and utility, and it offers practical recommendations and a mathematical framework for privacy assessment of DP RFs in real-world deployments.

Abstract

Recent research has shown that structured machine learning models such as tree ensembles are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection. In this paper, we introduce a reconstruction attack targeting state-of-the-art $ε$-DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest's structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we also provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks while maintaining a non-trivial predictive performance.

Training Set Reconstruction from Differentially Private Forests: How Effective is DP?

TL;DR

The paper investigates whether training data in differentially private random forests can be reconstructed. It introduces a constraint-programming–based reconstruction attack that leverages the forest structure and DP noise to recover a most likely training dataset, and it evaluates this approach on three tabular datasets across multiple DP budgets and forest configurations. The findings reveal that, although DP attenuates leakage, meaningful reconstruction of training data remains possible for non-trivial values, with leakage often extending to dataset-specific details rather than mere distributional patterns; only extremely sparse, trivially predictive forests appear robust. The work highlights the need for careful DP mechanism design and hyperparameter tuning to balance privacy and utility, and it offers practical recommendations and a mathematical framework for privacy assessment of DP RFs in real-world deployments.

Abstract

Recent research has shown that structured machine learning models such as tree ensembles are vulnerable to privacy attacks targeting their training data. To mitigate these risks, differential privacy (DP) has become a widely adopted countermeasure, as it offers rigorous privacy protection. In this paper, we introduce a reconstruction attack targeting state-of-the-art -DP random forests. By leveraging a constraint programming model that incorporates knowledge of the forest's structure and DP mechanism characteristics, our approach formally reconstructs the most likely dataset that could have produced a given forest. Through extensive computational experiments, we examine the interplay between model utility, privacy guarantees and reconstruction accuracy across various configurations. Our results reveal that random forests trained with meaningful DP guarantees can still leak portions of their training data. Specifically, while DP reduces the success of reconstruction attacks, the only forests fully robust to our attack exhibit predictive performance no better than a constant classifier. Building on these insights, we also provide practical recommendations for the construction of DP random forests that are more resilient to reconstruction attacks while maintaining a non-trivial predictive performance.

Paper Structure

This paper contains 24 sections, 23 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Example decision tree $t$, before and after adding Laplace noise to comply with 1-DP. The rightmost leaf, originally empty, now reports a positive sample count.
  • Figure 2: Average reconstruction error as a function of the privacy budget $\varepsilon$ used to fit the target DP RF, for different numbers of depth-5 trees $\vert \forest \rvert$ on the COMPAS dataset. For comparison, we also report the reconstruction error of DRAFT applied to the same RFs without DP protection (the x-axis does not apply for this baseline), using a dashed line.
  • Figure 3: Average training accuracy of $\varepsilon$-DP RFs with depth-7 trees as a function of the privacy budget $\varepsilon$, for different forest sizes $\lvert \forest \rvert$ on the COMPAS dataset
  • Figure 4: Average training accuracy of $\varepsilon$-DP RFs with depth-5 trees as a function of the reconstruction error, for different privacy budgets $\varepsilon$ and forest sizes $\lvert \forest \rvert$ on the COMPAS dataset.
  • Figure 5: Width of $\Delta_{\tree\node\class }$ search interval as a function of $\varepsilon_v$. As expected, the magnitude of the noise added decreases when the privacy budget increases, resulting in a smaller search interval.
  • ...and 12 more figures