Table of Contents
Fetching ...

DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

Shurong Wang, Zhuoyang Shen, Xinbao Qiao, Tongning Zhang, Meng Zhang

TL;DR

DynFrs introduces an efficient machine unlearning framework for Random Forests by coupling Occ(q) cross-tree subsampling with a lazy tag mechanism (Lzy) and using Extremely Randomized Trees (ERT) as the base learner. The approach achieves exact unlearning with favorable theoretical time bounds while delivering substantial practical speedups, especially in online and batch scenarios, and often preserves or improves predictive accuracy. Empirical results across nine binary datasets and a Higgs-scale online stream show orders-of-magnitude improvements over naïve retraining and competitive performance versus prior RF unlearning methods. The framework enables real-time, continual learning and unlearning in privacy-sensitive settings, with open-source reproducibility resources provided. The combination of subsampling, lazy updates, and robust ERTs enables fast, scalable unlearning suitable for dynamic data environments.

Abstract

Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.

DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

TL;DR

DynFrs introduces an efficient machine unlearning framework for Random Forests by coupling Occ(q) cross-tree subsampling with a lazy tag mechanism (Lzy) and using Extremely Randomized Trees (ERT) as the base learner. The approach achieves exact unlearning with favorable theoretical time bounds while delivering substantial practical speedups, especially in online and batch scenarios, and often preserves or improves predictive accuracy. Empirical results across nine binary datasets and a Higgs-scale online stream show orders-of-magnitude improvements over naïve retraining and competitive performance versus prior RF unlearning methods. The framework enables real-time, continual learning and unlearning in privacy-sensitive settings, with open-source reproducibility resources provided. The combination of subsampling, lazy updates, and robust ERTs enables fast, scalable unlearning suitable for dynamic data environments.

Abstract

Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.
Paper Structure (35 sections, 9 theorems, 7 equations, 6 figures, 15 tables, 3 algorithms)

This paper contains 35 sections, 9 theorems, 7 equations, 6 figures, 15 tables, 3 algorithms.

Key Result

Theorem 1

Sample addition and removal for the $\textproc{DynFrs}$ framework are exact.

Figures (6)

  • Figure 1: Left: (a) A sample addition/removal request arises. (b) Nodes it impacts are covered in the blue path. (c) There is a change in best split in the purple node (but not in other visited nodes), so a tag is placed on it. (d) The subtree of the tagged node is deleted. Middle: (a) A querying request arises. (b) Nodes that determine the prediction are covered in the orange path. (c) The tag is pushed down recursively until the query reaches a leaf. Right: A detailed process of how the querying request grows (split tagged node, push down its tag to its children recursively) the tree.
  • Figure 2: $\textproc{DynFrs}$ with different $q$ is tested on the dataset Adult and Diabetes, and the tendency is shown by the curve with the standard deviation shown by error bars.
  • Figure 3: $\textproc{DynFrs}$'s unlearning exactness. The blue-gray solid line represents the accuracy tendency of the unlearned model, while the orange dotted line represents that of the retrained model.
  • Figure 4: Comparison of sequential unlearning boost ($\uparrow$) between $\textproc{DynFrs}$ with different $q$ and DaRE, and error bars represent the minimum and maximum values among five trials.
  • Figure 5: Comparison of the batch unlearning runtime ($\downarrow$) of $\textproc{DynFrs}$ and two baseline methods under with 4 different unlearning batch sizes.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 1
  • proof
  • Lemma 1
  • proof
  • Theorem 2
  • proof
  • ...and 4 more