Table of Contents
Fetching ...

Differential Privacy Under Class Imbalance: Methods and Empirical Insights

Lucas Rosenblatt, Yuliia Lut, Eitan Turok, Marco Avella-Medina, Rachel Cummings

TL;DR

Private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.

Abstract

Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance; these include oversampling, SMOTE, and private synthetic data generation. We also consider DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance; these include model bagging, class-weighted empirical risk minimization and class-weighted deep learning. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.

Differential Privacy Under Class Imbalance: Methods and Empirical Insights

TL;DR

Private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.

Abstract

Imbalanced learning occurs in classification settings where the distribution of class-labels is highly skewed in the training data, such as when predicting rare diseases or in fraud detection. This class imbalance presents a significant algorithmic challenge, which can be further exacerbated when privacy-preserving techniques such as differential privacy are applied to protect sensitive training data. Our work formalizes these challenges and provides a number of algorithmic solutions. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance; these include oversampling, SMOTE, and private synthetic data generation. We also consider DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance; these include model bagging, class-weighted empirical risk minimization and class-weighted deep learning. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy. Finally, we empirically evaluate these privacy-preserving imbalanced learning methods under various data and distributional settings. We find that private synthetic data methods perform well as a data pre-processing step, while class-weighted ERMs are an alternative in higher-dimensional settings where private synthetic data suffers from the curse of dimensionality.

Paper Structure

This paper contains 47 sections, 16 theorems, 62 equations, 13 figures, 12 tables, 4 algorithms.

Key Result

Theorem 2

Let $\mathcal{M}_1$ be an algorithm that is $(\epsilon_1,\delta_1)$-DP, and let $\mathcal{M}_2$ be an algorithm that is $(\epsilon_2,\delta_2)$-DP. Then their composition $(\mathcal{M}_1,\mathcal{M}_2)$ is $(\epsilon_1 + \epsilon_2,\delta_1+\delta_2)$-DP.

Figures (13)

  • Figure 1: Performance for mammography dataset under varying $\epsilon$ parameters for overall performance metrics (AUC, F1, Balanced Accuracy, Precision) and metrics appropriate for imbalanced classification settings (Recall, Worst Class Accuracy, Macro Average Accuracy, Geometric Mean).
  • Figure 2: Top row shows decision boundaries of non-DP classifiers (high performance on the task, $AUC \in [0.94,0.97]$). Bottom row illustrates the decision boundaries of DP classifiers ($\epsilon=1.0$, $\delta=1\texttt{e-5}$ where applicable), which perform worse. The underlying true data generating function for each class is represented as an ellipse (dotted white line), where the center of the ellipse is the mean and each point on the dotted line represents 2 standard deviations from the mean.
  • Figure 3: SMOTE pre-processing on downstream DP logistic regression (with adjusted $\epsilon$) on the mammography dataset. Data was subsampled (log-scale x-axis: $n \in [500, 1000, 2000, 5000, 10000]$) and evaluated across imbalance ratios $r \in [4, 8, 16, 32]$.
  • Figure 4: $F1$ score performance on subsamples of the mammography dataset (imblearn) comparing differentially private logistic regression (DPLR) and DP bagging (DPLR as weak learner). Data was subsampled (log-scale x-axis: $n \in [500, 1000, 2000, 5000, 10000]$) and evaluated across imbalance ratios $r \in [4, 8, 16, 32]$.
  • Figure 5: Comparison of PrivBayes zhang2017privbayes and GEM liu2021iterative as private preprocessing steps on the mammography dataset, with XGBoost as the downstream non-private classifier. PrivBayes, while generally weaker, shows similar performance trends to GEM as $\epsilon$ increases and is a strong private pre-processing step for imbalanced classification.
  • ...and 8 more figures

Theorems & Definitions (28)

  • Definition 1: Differential Privacy dwork2006calibrating
  • Theorem 2: Basic Composition dwork2006calibrating
  • Theorem 3: Post-processing dwork2006calibrating
  • Proposition 4
  • Theorem 5
  • Lemma 5
  • Proposition 6
  • Proposition 6
  • Example 7
  • Proposition 7
  • ...and 18 more