PRIMO: Private Regression in Multiple Outcomes

Seth Neel

PRIMO: Private Regression in Multiple Outcomes

Seth Neel

TL;DR

PRIMO introduces private regression in multiple outcomes, formalizing the joint objective $\min_W \|XW-Y\|_F^2$ under differential privacy. It develops two algorithmic families: ReuseCovGauss (Full DP) and ReuseCovProj (projection-based), achieving favorable privacy-utility tradeoffs across regimes and, in some cases, eliminating explicit dependence on the number of outcomes $l$ for large $l$. The methods exploit shared covariances across regressions by reusing noisy $X^{T}X$ and, in projection-based variants, privately releasing $X^{T}Y$ with improved $l$-scaling under feature or label DP. Empirical results on genomic datasets (1KG, dbGaP) show projection-based PRIMO can outperform naive baselines and remains effective for very large numbers of outcomes, highlighting practical impact for multi-phenotype genomic risk prediction under privacy constraints.

Abstract

We introduce a new private regression setting we call Private Regression in Multiple Outcomes (PRIMO), inspired by the common situation where a data analyst wants to perform a set of $l$ regressions while preserving privacy, where the features $X$ are shared across all $l$ regressions, and each regression $i \in [l]$ has a different vector of outcomes $y_i$. Naively applying existing private linear regression techniques $l$ times leads to a $\sqrt{l}$ multiplicative increase in error over the standard linear regression setting. We apply a variety of techniques including sufficient statistics perturbation (SSP) and geometric projection-based methods to develop scalable algorithms that outperform this baseline across a range of parameter regimes. In particular, we obtain no dependence on l in the asymptotic error when $l$ is sufficiently large. Empirically, on the task of genomic risk prediction with multiple phenotypes we find that even for values of $l$ far smaller than the theory would predict, our projection-based method improves the accuracy relative to the variant that doesn't use the projection.

PRIMO: Private Regression in Multiple Outcomes

TL;DR

PRIMO introduces private regression in multiple outcomes, formalizing the joint objective

under differential privacy. It develops two algorithmic families: ReuseCovGauss (Full DP) and ReuseCovProj (projection-based), achieving favorable privacy-utility tradeoffs across regimes and, in some cases, eliminating explicit dependence on the number of outcomes

for large

. The methods exploit shared covariances across regressions by reusing noisy

and, in projection-based variants, privately releasing

with improved

-scaling under feature or label DP. Empirical results on genomic datasets (1KG, dbGaP) show projection-based PRIMO can outperform naive baselines and remains effective for very large numbers of outcomes, highlighting practical impact for multi-phenotype genomic risk prediction under privacy constraints.

Abstract

We introduce a new private regression setting we call Private Regression in Multiple Outcomes (PRIMO), inspired by the common situation where a data analyst wants to perform a set of

regressions while preserving privacy, where the features

are shared across all

regressions, and each regression

has a different vector of outcomes

. Naively applying existing private linear regression techniques

times leads to a

multiplicative increase in error over the standard linear regression setting. We apply a variety of techniques including sufficient statistics perturbation (SSP) and geometric projection-based methods to develop scalable algorithms that outperform this baseline across a range of parameter regimes. In particular, we obtain no dependence on l in the asymptotic error when

is sufficiently large. Empirically, on the task of genomic risk prediction with multiple phenotypes we find that even for values of

far smaller than the theory would predict, our projection-based method improves the accuracy relative to the variant that doesn't use the projection.

Paper Structure (28 sections, 20 theorems, 62 equations, 6 figures, 1 table, 4 algorithms)

This paper contains 28 sections, 20 theorems, 62 equations, 6 figures, 1 table, 4 algorithms.

Introduction
Preliminaries
Private Linear Regression
Results
Full DP: The ReuseCovGauss Algorithm
Improved Algorithms for Large $l$
PRIMO Under Feature Differential Privacy
PRIMO Under Label Differential Privacy
Computational Efficiency
Experiments
Limitations
Appendix
Additional Related Work
Query Release.
Projection mechanisms and Algorithm \ref{['alg:proj']}.
...and 13 more sections

Key Result

Lemma 1

Let $f: \mathcal{X}^{n} \to \mathbb{R}^{d}$ an arbitrary $d$-dimensional function, and define it's sensitivity $\Delta_2(f) = \sup_{X \sim X'}||f(X)- f(X')||_2$, where $X \sim X'$ are datasets that differ in exactly one element. Then the Gaussian mechanism $\texttt{GaussMech}(\varepsilon, \delta, \D

Figures (6)

Figure 1: PRIMO imagines the scenario where the weights $W$ computed as a function of sensitive features $X$ and multiple outcomes $Y$ are published or leaked. Our algorithms prevent an adversary with access to $W$ from exposing the underlying sensitive data $x_i, y^i$.
Figure 2: Each $R^2$ value is averaged over $10$ iterations. The shaded area around the lines indicates the error bars for the $R^2$ value at a given value of $(l, d)$. In Figures (a)-(d) we plot the average $R^2$ for $d = 25, l = (1, 11, 101, 201, 401,601, 801, 1001),$ fixing $(\epsilon, \delta) = (5,\frac{1}{n^2})$ with $n = 5008, 6042$. In Figures (e)-(f) we show $l$ up to $1e5$.
Figure 3: Comparing the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator. Each value is averaged over 10 iterations. The shaded area around the lines indicates the error bars at the given value of $l$. (a) and (b) range $l$ up to $100,000$, (c) and (d) show $l = (1, 11, 101, 201, 401,601, 801, 1001)$ while we fixed $d=25$ and $(\epsilon, \delta) = (5,\frac{1}{n^2})$ with $n = 5008, 6042$ for 1000 Genomes and dbGaP respectively
Figure 4: (a) shows the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator and (b) shows the $R^2$ values. For both of these, the value was averaged over $10$ iterations. The shaded area around the lines indicates the error bars at a given value of $l$. We plot this for $l = (1,10,25,100,500,1000,5000)$ with $d=25$, $(\epsilon, \delta) = (5,\frac{1}{n^2})$ and $n = 5\cdot 1\mathrm{e}{5}$ on a synthetic dbGaP dataset.
Figure 5: Both (a) and (b) show the log of the ratio of the squared loss of the private estimator to the square loss of the OLS estimator, but in (a) the outcomes are synthetically generated from a linear model, whereas in (b) the outcomes are generated from a 2-layer neural network, to test how the algorithms perform when the outcomes are not generated by a linear model. All values are averaged over $5$ iterations, over $l = (1, 11, 101)$ with $(\epsilon, \delta) = (5,\frac{1}{n^2})$ and $n = 5000, d = 25$ on a synthetic dataset with Gaussian features.
...and 1 more figures

Theorems & Definitions (34)

Definition 1
Definition 2
Definition 3
Lemma 1: privacybook
Theorem 1
proof
Lemma 2: revisit
proof
Lemma 3: revisit
Lemma 4: revisit
...and 24 more

PRIMO: Private Regression in Multiple Outcomes

TL;DR

Abstract

PRIMO: Private Regression in Multiple Outcomes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (34)