Table of Contents
Fetching ...

Privacy-Preserving Community Detection for Locally Distributed Multiple Networks

Xiao Guo, Xiang Li, Xiangyu Chang, Shujie Ma

TL;DR

A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed, which adopts the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy.

Abstract

Modern multi-layer networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. The literature on the model-based statistical methods for community detection based on these data is still limited. This paper proposes a new method for consensus community detection and estimation in a multi-layer stochastic block model using locally stored and computed network data with privacy protection. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors using an orthogonal Procrustes transformation, and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds.

Privacy-Preserving Community Detection for Locally Distributed Multiple Networks

TL;DR

A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed, which adopts the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy.

Abstract

Modern multi-layer networks are commonly stored and analyzed in a local and distributed fashion because of the privacy, ownership, and communication costs. The literature on the model-based statistical methods for community detection based on these data is still limited. This paper proposes a new method for consensus community detection and estimation in a multi-layer stochastic block model using locally stored and computed network data with privacy protection. A novel algorithm named privacy-preserving Distributed Spectral Clustering (ppDSC) is developed. To preserve the edges' privacy, we adopt the randomized response (RR) mechanism to perturb the network edges, which satisfies the strong notion of differential privacy. The ppDSC algorithm is performed on the squared RR-perturbed adjacency matrices to prevent possible cancellation of communities among different layers. To remove the bias incurred by RR and the squared network matrices, we develop a two-step bias-adjustment procedure. Then we perform eigen-decomposition on the debiased matrices, aggregation of the local eigenvectors using an orthogonal Procrustes transformation, and k-means clustering. We provide theoretical analysis on the statistical errors of ppDSC in terms of eigen-vector estimation. In addition, the blessings and curses of network heterogeneity are well-explained by our bounds.
Paper Structure (42 sections, 13 theorems, 80 equations, 9 figures, 1 algorithm)

This paper contains 42 sections, 13 theorems, 80 equations, 9 figures, 1 algorithm.

Key Result

Proposition 1

The randomized response mechanism satisfies $\epsilon$-edge-DP with

Figures (9)

  • Figure 1: The feasible region (in blue) of $q$ and $q'$ under a given privacy budget $\epsilon$.
  • Figure 2: Comparison of ppDSC, ppDSC-1b, ppDSC-2b, ppSC and Oracle on the simulated data. The effect of the number of networks $L$, the number of nodes $n$, and the number of local machines $m$ are shown, respectively. The projection distance and misclassification rate are evaluated.
  • Figure 3: The clustering performance of ppDSC with a pre-specified $\epsilon=1$ and $q,q'$ varying in the feasible region of DP (see Figure \ref{['region']}). The empirically and theoretically best combination of $q,q'$ are marked with red star and pink cross, respectively.
  • Figure 4: Comparison of ppDSC, ppDSC-1b, ppDSC-2b and Oracle on the AUCS network. The effect of the privacy parameters $q,q'$ ($q,q'$ synchronously vary, $q'$ fixed but $q$ varies, $q$ fixed but $q'$ varies), and the number of networks $L$ are shown, respectively. The misclassification rate with respect to the research group labels is evaluated.
  • Figure C.1: Illustration for the negative effect of heterogeneity (Model I). (a) corresponds to population parameters including the heterogeneity, eigen-gap and their ratios against $\alpha$. (b) and (c) correspond to the projection distance and misclassification rate of ppDSC, ppDSC-1b, ppDSC-2b, ppSC and Oracle.
  • ...and 4 more figures

Theorems & Definitions (26)

  • Definition 1: Edge-DP
  • Remark 1
  • Proposition 1: karwa2017sharing
  • Theorem 1
  • Definition 2: Heterogeneity
  • Remark 2
  • Theorem 2: Error Decomposition
  • Remark 3
  • Remark 4
  • Theorem 3
  • ...and 16 more