Table of Contents
Fetching ...

CURATE: Scaling-up Differentially Private Causal Graph Discovery

Payel Bhattacharjee, Ravi Tandon

TL;DR

CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting that allows for adaptive privacy budgeting by minimizing error probability, maximizing iterations of the optimization problem (score-based) while keeping the cumulative leakage bounded.

Abstract

Causal Graph Discovery (CGD) is the process of estimating the underlying probabilistic graphical model that represents joint distribution of features of a dataset. CGD-algorithms are broadly classified into two categories: (i) Constraint-based algorithms (outcome depends on conditional independence (CI) tests), (ii) Score-based algorithms (outcome depends on optimized score-function). Since, sensitive features of observational data is prone to privacy-leakage, Differential Privacy (DP) has been adopted to ensure user privacy in CGD. Adding same amount of noise in this sequential-natured estimation process affects the predictive performance of the algorithms. As initial CI tests in constraint-based algorithms and later iterations of the optimization process of score-based algorithms are crucial, they need to be more accurate, less noisy. Based on this key observation, we present CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting. In contrast to existing DP-CGD algorithms with uniform privacy budgeting across all iterations, CURATE allows adaptive privacy budgeting by minimizing error probability (for constraint-based), maximizing iterations of the optimization problem (for score-based) while keeping the cumulative leakage bounded. To validate our framework, we present a comprehensive set of experiments on several datasets and show that CURATE achieves higher utility compared to existing DP-CGD algorithms with less privacy-leakage.

CURATE: Scaling-up Differentially Private Causal Graph Discovery

TL;DR

CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting that allows for adaptive privacy budgeting by minimizing error probability, maximizing iterations of the optimization problem (score-based) while keeping the cumulative leakage bounded.

Abstract

Causal Graph Discovery (CGD) is the process of estimating the underlying probabilistic graphical model that represents joint distribution of features of a dataset. CGD-algorithms are broadly classified into two categories: (i) Constraint-based algorithms (outcome depends on conditional independence (CI) tests), (ii) Score-based algorithms (outcome depends on optimized score-function). Since, sensitive features of observational data is prone to privacy-leakage, Differential Privacy (DP) has been adopted to ensure user privacy in CGD. Adding same amount of noise in this sequential-natured estimation process affects the predictive performance of the algorithms. As initial CI tests in constraint-based algorithms and later iterations of the optimization process of score-based algorithms are crucial, they need to be more accurate, less noisy. Based on this key observation, we present CURATE (CaUsal gRaph AdapTivE privacy), a DP-CGD framework with adaptive privacy budgeting. In contrast to existing DP-CGD algorithms with uniform privacy budgeting across all iterations, CURATE allows adaptive privacy budgeting by minimizing error probability (for constraint-based), maximizing iterations of the optimization problem (for score-based) while keeping the cumulative leakage bounded. To validate our framework, we present a comprehensive set of experiments on several datasets and show that CURATE achieves higher utility compared to existing DP-CGD algorithms with less privacy-leakage.
Paper Structure (11 sections, 2 theorems, 29 equations, 7 figures, 2 algorithms)

This paper contains 11 sections, 2 theorems, 29 equations, 7 figures, 2 algorithms.

Key Result

Lemma 1

For some $c_1, c_2 \in (0,1)$, and non-negative test threshold margins $(\beta_1,\beta_2)$, the relative Type-I ($\mathbb{P}[E_1^i]$) and Type-II ($\mathbb{P}[E_2^i]$) errors in order-$(i)$ CI tests in CURATE with privacy budget $\epsilon_i$ and $l_1$-sensitivity $\Delta$ can be bounded as:

Figures (7)

  • Figure 1: The generic workflow of constraint-based CGD algorithms with two phases: Skeleton Phase and Orientation Phase. The skeleton phase starts with a fully connected graph with $d$ nodes, where $d$ is the number of features/variables. $k_i$ is the maximum number of CI tests in order $i$. The sequence and number of tests in any order $i$ are dependent on the outcomes of order $(i-1)$ tests, and the skeleton phase is prone to privacy leakage.
  • Figure 2: The composition mechanism in constraint-based CURATE across all order of CI tests. For every order-(i), total privacy leakage is calculated with Advanced Composition sice the privacy budgets and failure probability for all order-$(i)$ tests are same. The total leakage across all orders is then calculated by constraint-based CURATE with Basic Composition.
  • Figure 3: Possible number of iterations ($I$) given a total amount of privacy budget ($\epsilon_{\text{Total}}$) and initial privacy budget ($\epsilon_0$). For varied total privacy budget ($\epsilon_{\text{Total}}=0.1,\epsilon_{\text{Total}}=1.0$,$\epsilon_{\text{Total}}=10.0$) and different initial budget ($\epsilon_0<<1.0$ and $\epsilon_0>1.0$) we can observe that in the high privacy regime (i.e., $\epsilon_0<<1.0$) the multiplicative method executes more number of iterations.
  • Figure 4: Part (a) represents the performance evaluation of differentially private CGD algorithms EM-PC xu_differential_2017, SVT-PC, Priv-PC wang_towards_2020, NOLEAKS ma_noleaks_2022 and CURATE (score-based and constraint-based) in terms of total leakage vs F1 score on 6 public CGD datasets: Cancer, Earthquake, Survey, Asia, Sachs, Child. Part (b) presents the mean and standard deviation of F1-score for 50 consecutive runs for three privacy regimes ($\epsilon_{\text{Total}} = 0.1$, $\epsilon_{\text{Total}}=5.0$, $\epsilon_{\text{Total}}=10.0$).
  • Figure 5: Average CI tests required to achieve the maximum F1 score with comparatively large amount of total leakage ($\epsilon_{\text{Total}}=1.0$) on Cancer, Earthquake, Survey, Asia, Sachs, and Child datasets. Average CI tests in CURATE converge to the non-private PC algorithm whereas EM-PC zheng_dags_2018, Priv-PC and SVT-PC wang_towards_2020 tend to run more CI tests.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Definition 1: Probabilistic Graphical Model
  • Definition 2: Causal Graph Discovery
  • Definition 3: ($\epsilon,\delta$)-Differential Privacy
  • Definition 4: $l_k$- sensitivity
  • Definition 5: Analytic Gaussian Mechanism balle_privacy_2018
  • Lemma 1
  • Lemma 2