Table of Contents
Fetching ...

Differential Privacy for Network Connectedness Indices

Tom A. Rutter, Yuxin Liu, M. Amin Rahimian

Abstract

Researchers increasingly use data on social and economic networks to study a range of social science questions, but releasing statistics derived from networks can raise significant privacy concerns. We show how to release network connectedness indices that quantify assortative mixing across node attributes under edge-adjacent differential privacy. Standard privacy techniques perform poorly in this setting both because connectedness indices have high global sensitivity and because a single node's attribute can potentially be an input to connectedness in thousands of cells, leading to poor composition. Our method, which is straightforward to apply, first adds noise to node attributes, then analytically debiases downstream statistics, and finally applies a second layer of noise to protect the presence or absence of individual edges. We prove consistency and asymptotic normality of our estimators for both discrete and continuous labels and show our method works well in simulations and on real networks with as few as 200 nodes collected by social scientists.

Differential Privacy for Network Connectedness Indices

Abstract

Researchers increasingly use data on social and economic networks to study a range of social science questions, but releasing statistics derived from networks can raise significant privacy concerns. We show how to release network connectedness indices that quantify assortative mixing across node attributes under edge-adjacent differential privacy. Standard privacy techniques perform poorly in this setting both because connectedness indices have high global sensitivity and because a single node's attribute can potentially be an input to connectedness in thousands of cells, leading to poor composition. Our method, which is straightforward to apply, first adds noise to node attributes, then analytically debiases downstream statistics, and finally applies a second layer of noise to protect the presence or absence of individual edges. We prove consistency and asymptotic normality of our estimators for both discrete and continuous labels and show our method works well in simulations and on real networks with as few as 200 nodes collected by social scientists.
Paper Structure (34 sections, 14 theorems, 274 equations, 11 figures, 3 tables, 3 algorithms)

This paper contains 34 sections, 14 theorems, 274 equations, 11 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Let $(\mathcal{V},\mathcal{E},\mathbf{L})$ and $(\mathcal{V},\mathcal{E}',\mathbf{L}')$ be edge-adjacent labeled networks. Let $\mathcal{M}_1 : (\mathcal{V},\mathcal{E},\mathbf{L}) \mapsto (\mathcal{V},\mathcal{E},\widehat{\mathbf{L}})$ be $(\varepsilon_\ell,\delta_\ell)$-DP with respect to changing

Figures (11)

  • Figure 1: An example of how to calculate the cross-type connectedness index.
  • Figure 2: A star-network illustration of why node-level privacy notions can be too strong for connectedness statistics: modifying the center node (or protecting its presence/incident edges under node-DP) can move red-to-blue connectedness from $0$ to $1$, implying sensitivity that spans $[0,1]$ and does not diminish with network size.
  • Figure 3: Mean squared error vs homophily by privacy. Panel (a): This figure illustrates the impact of homophily on the accuracy of the differentially private mechanism across four privacy budgets ($\varepsilon \in \{0.5, 1, 2, 4\}$). In each case we split our privacy budget equally between $\varepsilon_e$ and $\varepsilon_l$. Our simulations are based on networks of 5000 nodes generated by a stochastic block model with two equally sized groups. To isolate the effect of community structure from network density, the total average degree is held constant at $\approx 80$. This is achieved by sweeping the within-group connection probability from 0.04 to 0.08 while simultaneously decreasing the between-group probability from 0.04 to 0 such that the sum of the within-group connection probability and the between-group connection probability is 0.08. Results are averaged over 1,125 simulation samples per data point (75 fixed graphs $\times$ 15 coupled noise seeds). The vertical axis utilizes a log scale. Panel (b): This figure plots the MSE of our differentially private mechanism as a function of the privacy budget ($\varepsilon$) for three simulated network scenarios: No Homophily, Low Homophily, and High Homophily. Accuracy is calculated based on 500 samples per epsilon value, utilizing 50 pre-generated fixed graphs and 10 coupled privacy noise processes per graph to isolate the privacy-induced variance. Shaded regions represent 95% confidence intervals. The privacy budget is split equally between $\varepsilon_e$ and $\varepsilon_l$. Each simulated network consists of $N=2,000$ nodes. We simulate networks using a stochastic block model with two equally sized groups. For the no homophily case, we set the connection probability to 0.04 both within and across groups. For the low homophily case, we set the within-group connection probability to be 0.06, and the between-group connection probability to be 0.02. For the high homophily case, we set the within-group connection probability to be 0.08 and the between-group connection probability to be 0.
  • Figure 4: MSE holding $\varepsilon_e + \varepsilon_l$ constant.
  • Figure 5: Simulations for continuous labels. Panel (\ref{['fig:r2-homophily']}) plots the $R^2$ from regressions of average friend rank on own rank in networks generated by sampling node labels uniformly from $[0,1]$ and filling edges with probability $\frac{20}{99{,}999\left(\frac{2}{h}-\frac{2}{h^2}(1-e^{-h})\right)}e^{-h|x_i-x_j|}$, for varying levels of $h$. Each simulated network contains 100,000 nodes and has average degree 20. Panel (\ref{['fig:reg-priv-allocation']}) shows results for four network sizes with homophily $h=0.8$ and expected degree $\bar{d}=20$, averaged over 3,000 simulations per point.
  • ...and 6 more figures

Theorems & Definitions (35)

  • Definition 1: $\varepsilon$-differential privacy
  • Definition 2: Labeled Network
  • Definition 3: Cross-Type Connectedness Index
  • Definition 4: Same-Type Connectedness Index
  • Definition 5: Edge-Adjacent Labeled Networks
  • Theorem 1: (epsilon,delta) composition under edge-adjacency
  • proof
  • Proposition 1: MVUE for individual connectedness
  • Lemma 1: Edge-sensitivity of $S_{1,n}$
  • Theorem 2: Consistency of the debiased private estimator
  • ...and 25 more