Table of Contents
Fetching ...

Differentially Private Linear Regression with Linked Data

Shurong Lin, Elliot Paquette, Eric D. Kolaczyk

TL;DR

This work addresses performing linear regression after record linkage under differential privacy, explicitly incorporating linkage uncertainty through a matching probability matrix $Q$. It introduces two post-RL methods—Post-RL Noisy Gradient Descent (NGD) and Post-RL Sufficient Statistics Perturbation (SSP)—that inject DP noise while accounting for the RL process, with privacy guarantees and finite-sample error analyses. The authors derive explicit privacy budgets and error bounds, provide variance characterizations, and validate the approaches via simulations and synthetic-data experiments, illustrating the DP-accuracy tradeoffs induced by linkage error. The results enable practical privacy-preserving regression on linked data and guide budgeting of privacy vs. accuracy when linkage uncertainty is present.

Abstract

There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.

Differentially Private Linear Regression with Linked Data

TL;DR

This work addresses performing linear regression after record linkage under differential privacy, explicitly incorporating linkage uncertainty through a matching probability matrix . It introduces two post-RL methods—Post-RL Noisy Gradient Descent (NGD) and Post-RL Sufficient Statistics Perturbation (SSP)—that inject DP noise while accounting for the RL process, with privacy guarantees and finite-sample error analyses. The authors derive explicit privacy budgets and error bounds, provide variance characterizations, and validate the approaches via simulations and synthetic-data experiments, illustrating the DP-accuracy tradeoffs induced by linkage error. The results enable practical privacy-preserving regression on linked data and guide budgeting of privacy vs. accuracy when linkage uncertainty is present.

Abstract

There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.
Paper Structure (22 sections, 21 theorems, 91 equations, 6 figures, 2 algorithms)

This paper contains 22 sections, 21 theorems, 91 equations, 6 figures, 2 algorithms.

Key Result

Lemma 2.1

Under the model described by (model:lm) and (model:z), we have for $i,j = 1, \dots, n$

Figures (6)

  • Figure 1: Pipeline of private regression with linked data.
  • Figure 2: A toy example of record linkage with mismatches (dashed links). The true dataset $(X, \bm y)$ is $\{(1,2),(2,4),(3,6),(4,8)\}$, yielding a slop estimate $\hat{\beta}_1=2$, while the linked set $(X, \bm z)$ is given by $\{(1,8),(2,4),(3,6),(4,2)\}$, yielding $\hat{\beta}_1=-1.6$.
  • Figure 3: Average $\ell_2$-error and variance (theoretical versus empirical), with $(\epsilon, \delta) = (1, 8.5\times 10^{-5})$, against $n$ and $\sigma$, respectively. The "RL-NGD" and "RL-SSP" algorithms are our proposed post-RL approaches applied to the linked data, compared with the non-RL "NGD" and "SSP" methods applied to $(X, \bm y)$ (i.e., with no linkage errors). The non-private "OLS" and "RL-OLS" lahiri2005 results are also plotted for benchmarking. The number of iterations for "RL-NGD" results fall within the range of $(210, 260)$.
  • Figure 4: Synthesization. The Ferlb dataset provides quasi-identifiers $(\Phi_{A}, \Phi_{B})$, and the SHIW dataset provides regression variables $(X, \bm y)$.
  • Figure 5: Boxplots of DP estimates based on 1000 repetitions with $(\epsilon, \delta) = (1, 8.5\times 10^{-5})$. The red dashed line indicates the OLS estimate. The proposed post-RL algorithms are compared with the non-RL "NGD" and "SSP" methods applied to $(X, \bm z)$ (i.e., without accounting present linkage errors). The third and fourth columns represent the two NGD methods running for $T = \lceil L^2\ln(c_0^2n)/3 \rceil$ iterations.
  • ...and 1 more figures

Theorems & Definitions (43)

  • Lemma 2.1: Theorem A.1, lahiri2005
  • Definition 1: $(\epsilon,\delta)$-DP, DworkR14
  • Proposition 2.1: Basic composition, DworkR14
  • Proposition 2.2: Post-processing, DworkR14
  • Definition 2: $\ell_2$-sensitivity
  • Lemma 2.2: Gaussian mechanism, DworkR14
  • Lemma 2.3: Better composition for $(\epsilon,\delta)$-DP via zCDP
  • Remark 3.1
  • Theorem 4.1: Privacy Guarantees
  • Lemma 4.2
  • ...and 33 more