Trained Random Forests Completely Reveal your Dataset

Julien Ferry; Ricardo Fukasawa; Timothée Pascal; Thibaut Vidal

Trained Random Forests Completely Reveal your Dataset

Julien Ferry, Ricardo Fukasawa, Timothée Pascal, Thibaut Vidal

TL;DR

This work investigates the privacy risks of releasing trained random forests under white-box access by formulating a maximum-likelihood dataset reconstruction problem, solvable via constraint programming. It proves the problem is NP-hard and demonstrates a CP-based framework that reconstructs training data from standard libraries using only forest structure and per-node counts, with complete or near-complete recovery when bagging is not used and substantial recovery even with bagging. Through extensive experiments on COMPAS, Adult, and Default datasets, the authors show that deep and large forests leak near-entire training data, while bagging offers partial protection, illustrating a real-world vulnerability in widely used ensemble methods. The study highlights practical implications for privacy, proposes open-source tooling, and suggests future directions including privacy-preserving mechanisms and extending the methodology to other model families and attribute types, emphasizing the need for mitigation in deployed systems.

Abstract

We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.

Trained Random Forests Completely Reveal your Dataset

TL;DR

Abstract

Paper Structure (36 sections, 1 theorem, 14 equations, 6 figures, 8 tables)

This paper contains 36 sections, 1 theorem, 14 equations, 6 figures, 8 tables.

Introduction
Technical Background
Supervised Machine Learning (ML).
Random Forests (RFs).
Training RFs.
Constraint Programming (CP).
Related Works
Illustrative Example
NP-Hardness Result
Constraint Programming Approach
Maximum log-likelihood objective.
Model simplifications when bagging is deactivated.
Reconstructing non-binary attributes.
Experimental Study
Experimental Setup
...and 21 more sections

Key Result

Theorem 5.1

The decision version of () is $\np$-complete.

Figures (6)

Figure 1: Example decision trees trained using scikit-learn on a small dataset (Table \ref{['tab:toy_dataset']}).
Figure 2: Average reconstruction error as a function of the number of trees $\lvert \forest \rvert$ within the target forest $\forest$, for different maximum depth values $_{max}$ and for the random baseline.
Figure 3: Example of () instance originating from \ref{['eq:3satex']}. The left branches correspond to setting the feature to 0. The right ones set the feature to 1. The numbers below are the $\nodesupport[\clause,\node,\class]$ values.
Figure 4: Average reconstruction error as a function of the number of trees $\lvert \forest \rvert$ within the attacked forest $\forest$, for different maximum depth values $_{max}$ and for the random baseline. For the experiments on the COMPAS dataset, not using bagging, we report the results obtained using either the CP model (Section \ref{['sec:method_implementation']}) or the MILP one (Section \ref{['subsec:milp_formulation']}).
Figure 5: Comparison of the benchmark results (using bagging, worst possible reconstruction error using our set of constraints if the number of occurrences of each example within each tree are known) with the "no-bagging" ones
...and 1 more figures

Theorems & Definitions (6)

Theorem 5.1
proof
Claim 1.1
proof
Claim 1.2
proof

Trained Random Forests Completely Reveal your Dataset

TL;DR

Abstract

Trained Random Forests Completely Reveal your Dataset

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (6)