Table of Contents
Fetching ...

PRIMO: Private Regression in Multiple Outcomes

Seth Neel

TL;DR

PRIMO introduces private regression in multiple outcomes, formalizing the joint objective $\min_W \|XW-Y\|_F^2$ under differential privacy. It develops two algorithmic families: ReuseCovGauss (Full DP) and ReuseCovProj (projection-based), achieving favorable privacy-utility tradeoffs across regimes and, in some cases, eliminating explicit dependence on the number of outcomes $l$ for large $l$. The methods exploit shared covariances across regressions by reusing noisy $X^{T}X$ and, in projection-based variants, privately releasing $X^{T}Y$ with improved $l$-scaling under feature or label DP. Empirical results on genomic datasets (1KG, dbGaP) show projection-based PRIMO can outperform naive baselines and remains effective for very large numbers of outcomes, highlighting practical impact for multi-phenotype genomic risk prediction under privacy constraints.

Abstract

We introduce a new private regression setting we call Private Regression in Multiple Outcomes (PRIMO), inspired by the common situation where a data analyst wants to perform a set of $l$ regressions while preserving privacy, where the features $X$ are shared across all $l$ regressions, and each regression $i \in [l]$ has a different vector of outcomes $y_i$. Naively applying existing private linear regression techniques $l$ times leads to a $\sqrt{l}$ multiplicative increase in error over the standard linear regression setting. We apply a variety of techniques including sufficient statistics perturbation (SSP) and geometric projection-based methods to develop scalable algorithms that outperform this baseline across a range of parameter regimes. In particular, we obtain no dependence on l in the asymptotic error when $l$ is sufficiently large. Empirically, on the task of genomic risk prediction with multiple phenotypes we find that even for values of $l$ far smaller than the theory would predict, our projection-based method improves the accuracy relative to the variant that doesn't use the projection.

PRIMO: Private Regression in Multiple Outcomes

TL;DR

PRIMO introduces private regression in multiple outcomes, formalizing the joint objective under differential privacy. It develops two algorithmic families: ReuseCovGauss (Full DP) and ReuseCovProj (projection-based), achieving favorable privacy-utility tradeoffs across regimes and, in some cases, eliminating explicit dependence on the number of outcomes for large . The methods exploit shared covariances across regressions by reusing noisy and, in projection-based variants, privately releasing with improved -scaling under feature or label DP. Empirical results on genomic datasets (1KG, dbGaP) show projection-based PRIMO can outperform naive baselines and remains effective for very large numbers of outcomes, highlighting practical impact for multi-phenotype genomic risk prediction under privacy constraints.

Abstract

We introduce a new private regression setting we call Private Regression in Multiple Outcomes (PRIMO), inspired by the common situation where a data analyst wants to perform a set of regressions while preserving privacy, where the features are shared across all regressions, and each regression has a different vector of outcomes . Naively applying existing private linear regression techniques times leads to a multiplicative increase in error over the standard linear regression setting. We apply a variety of techniques including sufficient statistics perturbation (SSP) and geometric projection-based methods to develop scalable algorithms that outperform this baseline across a range of parameter regimes. In particular, we obtain no dependence on l in the asymptotic error when is sufficiently large. Empirically, on the task of genomic risk prediction with multiple phenotypes we find that even for values of far smaller than the theory would predict, our projection-based method improves the accuracy relative to the variant that doesn't use the projection.
Paper Structure (28 sections, 20 theorems, 62 equations, 6 figures, 1 table, 4 algorithms)

This paper contains 28 sections, 20 theorems, 62 equations, 6 figures, 1 table, 4 algorithms.

Key Result

Lemma 1

Let $f: \mathcal{X}^{n} \to \mathbb{R}^{d}$ an arbitrary $d$-dimensional function, and define it's sensitivity $\Delta_2(f) = \sup_{X \sim X'}||f(X)- f(X')||_2$, where $X \sim X'$ are datasets that differ in exactly one element. Then the Gaussian mechanism $\texttt{GaussMech}(\varepsilon, \delta, \D

Figures (6)

  • Figure 1: PRIMO imagines the scenario where the weights $W$ computed as a function of sensitive features $X$ and multiple outcomes $Y$ are published or leaked. Our algorithms prevent an adversary with access to $W$ from exposing the underlying sensitive data $x_i, y^i$.
  • Figure 2: Each $R^2$ value is averaged over $10$ iterations. The shaded area around the lines indicates the error bars for the $R^2$ value at a given value of $(l, d)$. In Figures (a)-(d) we plot the average $R^2$ for $d = 25, l = (1, 11, 101, 201, 401,601, 801, 1001),$ fixing $(\epsilon, \delta) = (5,\frac{1}{n^2})$ with $n = 5008, 6042$. In Figures (e)-(f) we show $l$ up to $1e5$.
  • Figure 3: Comparing the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator. Each value is averaged over 10 iterations. The shaded area around the lines indicates the error bars at the given value of $l$. (a) and (b) range $l$ up to $100,000$, (c) and (d) show $l = (1, 11, 101, 201, 401,601, 801, 1001)$ while we fixed $d=25$ and $(\epsilon, \delta) = (5,\frac{1}{n^2})$ with $n = 5008, 6042$ for 1000 Genomes and dbGaP respectively
  • Figure 4: (a) shows the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator and (b) shows the $R^2$ values. For both of these, the value was averaged over $10$ iterations. The shaded area around the lines indicates the error bars at a given value of $l$. We plot this for $l = (1,10,25,100,500,1000,5000)$ with $d=25$, $(\epsilon, \delta) = (5,\frac{1}{n^2})$ and $n = 5\cdot 1\mathrm{e}{5}$ on a synthetic dbGaP dataset.
  • Figure 5: Both (a) and (b) show the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator, but in (a) the outcomes are synthetically generated from a linear model, whereas in (b) the outcomes are generated from a 2-layer neural network, to test how the algorithms perform when the outcomes are not generated by a linear model. All values are averaged over $5$ iterations, over $l = (1, 11, 101)$ with $(\epsilon, \delta) = (5,\frac{1}{n^2})$ and $n = 5000, d = 25$ on a synthetic dataset with Gaussian features.
  • ...and 1 more figures

Theorems & Definitions (34)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 1: privacybook
  • Theorem 1
  • proof
  • Lemma 2: revisit
  • proof
  • Lemma 3: revisit
  • Lemma 4: revisit
  • ...and 24 more