Table of Contents
Fetching ...

Feature Selection from Differentially Private Correlations

Ryan Swope, Amol Khanna, Philip Doldo, Saptarshi Roy, Edward Raff

TL;DR

A correlations-based order statistic is employed to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints, and it is found that this method significantly outperforms the established baseline for private feature selection on many datasets.

Abstract

Data scientists often seek to identify the most important features in high-dimensional datasets. This can be done through $L_1$-regularized regression, but this can become inefficient for very high-dimensional datasets. Additionally, high-dimensional regression can leak information about individual datapoints in a dataset. In this paper, we empirically evaluate the established baseline method for feature selection with differential privacy, the two-stage selection technique, and show that it is not stable under sparsity. This makes it perform poorly on real-world datasets, so we consider a different approach to private feature selection. We employ a correlations-based order statistic to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints. We find that our method significantly outperforms the established baseline for private feature selection on many datasets.

Feature Selection from Differentially Private Correlations

TL;DR

A correlations-based order statistic is employed to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints, and it is found that this method significantly outperforms the established baseline for private feature selection on many datasets.

Abstract

Data scientists often seek to identify the most important features in high-dimensional datasets. This can be done through -regularized regression, but this can become inefficient for very high-dimensional datasets. Additionally, high-dimensional regression can leak information about individual datapoints in a dataset. In this paper, we empirically evaluate the established baseline method for feature selection with differential privacy, the two-stage selection technique, and show that it is not stable under sparsity. This makes it perform poorly on real-world datasets, so we consider a different approach to private feature selection. We employ a correlations-based order statistic to choose important features from a dataset and privatize them to ensure that the results do not leak information about individual datapoints. We find that our method significantly outperforms the established baseline for private feature selection on many datasets.
Paper Structure (16 sections, 2 theorems, 11 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 16 sections, 2 theorems, 11 equations, 8 figures, 1 table, 2 algorithms.

Key Result

theorem 1

Let $\mathbf{X}$ and $\mathbf{y}$ be the design matrix and target vector after transformations in steps 1 and 2. Then $\lvert (\mathbf{X}^\top\mathbf{y})_i \rvert = \lvert \mathbf{x}_{(i)}^{\top}\mathbf{y} \rvert$. Denote $\left[ \lvert \mathbf{X}^\top\mathbf{y} \rvert \right]_j$ to be the $j^\text{ where $c_{d, k} = {d \choose k} / \frac{d^k}{k^k} \leq k$ and $\gamma$ is a hyperparameter of canon

Figures (8)

  • Figure 1: Occurrences of selecting feature $n$ in the two-stage and SIS algorithms. An ideal result for both experiments would select features 1 through 5 1000 times and all other features zero times. SIS is closer to the ideal result than two-stage.
  • Figure 2: Top-$k$ accuracy of models fit on features selected from on the Christensen dataset. DP-SIS outperforms the two-stage mechanism on $\epsilon$ values between $10^0$ and $10^1$, which are commonly used for private computation near2023guidelines.
  • Figure 3: Top-$k$ accuracy of models fit on features selected from on the Sorlie dataset. DP-SIS outperforms the two-stage mechanism on $\epsilon$ values greater than $10^0$, which are commonly used for private computation near2023guidelines.
  • Figure 4: Top-$k$ accuracy of models fit on features selected from on the Yeoh dataset. DP-SIS outperforms the two-stage mechanism on $\epsilon$ values greater than $2 \times 10^0$, which can be used in private computation near2023guidelines.
  • Figure 5: Top-$k$ accuracy of models fit on features selected from on the Synth dataset. DP-SIS outperforms the two-stage mechanism on $\epsilon$ values greater than $10^1$. Although such high $\epsilon$ values are typically not used for private computation, this result still demonstrates that DP-SIS has better results than the two-stage baseline.
  • ...and 3 more figures

Theorems & Definitions (4)

  • theorem 1
  • definition 1: Lipschitz Mechanism, $\kappa=1$
  • definition 2: Utility Class
  • theorem 2