Table of Contents
Fetching ...

Private Regression via Data-Dependent Sufficient Statistic Perturbation

Cecilia Ferrando, Daniel Sheldon

TL;DR

This work introduces data-dependent SSP (DD-SSP), which privately releases marginals via the AIM mechanism to estimate sufficient statistics for linear regression, yielding higher utility than traditional data-independent SSP AdaSSP. It extends the paradigm to logistic regression by a Chebyshev polynomial approximation that expresses the objective in terms of approximate sufficient statistics derived from pairwise marginals. Empirical results show DD-SSP matches or outperforms baselines in linear regression and offers competitive performance for logistic regression, closely aligning with AIM-Synth, which trains on DP synthetic data. The authors also establish a conceptual link between DP synthetic data and data-dependent SSP, and discuss extensions to broader model classes and continuous data mechanisms.

Abstract

Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.

Private Regression via Data-Dependent Sufficient Statistic Perturbation

TL;DR

This work introduces data-dependent SSP (DD-SSP), which privately releases marginals via the AIM mechanism to estimate sufficient statistics for linear regression, yielding higher utility than traditional data-independent SSP AdaSSP. It extends the paradigm to logistic regression by a Chebyshev polynomial approximation that expresses the objective in terms of approximate sufficient statistics derived from pairwise marginals. Empirical results show DD-SSP matches or outperforms baselines in linear regression and offers competitive performance for logistic regression, closely aligning with AIM-Synth, which trains on DP synthetic data. The authors also establish a conceptual link between DP synthetic data and data-dependent SSP, and discuss extensions to broader model classes and continuous data mechanisms.

Abstract

Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.
Paper Structure (18 sections, 4 theorems, 26 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 4 theorems, 26 equations, 3 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

Suppose $x = (u, w)$ where $u \in \mathbb R^{a}$ satisfies $\|u\| \leq \|\mathcal{U}\|$ and $w \in \mathbb R^b$ contains the one-hot encodings (either reduced or not reduced) of $c$ attributes. Then $\|\mathcal{X}\| := \sqrt{\|\mathcal{U}\|^2 + c}$ is an upper bound on $\|x\|$.

Figures (3)

  • Figure 1: Diagram representing the SSP data-independent workflow (left) vs the data-dependent linear query answering mechanism for marginal release/synthetic data workflow (right). Quantities indicated in blue follow privacy noise injection and are differentially private.
  • Figure 2: Degree 2 Chebyshev approximation of the logit function $\phi$, where $\phi (s) := - \log (1+e^{-s})$
  • Figure 3: Linear regression MSE results (first row) and logistic regression AUC results (second row). Standard error bars are computed over 5 trials.

Theorems & Definitions (16)

  • Definition 2.1: Neighboring datasets
  • Definition 2.2: $L_2$ sensitivity
  • Definition 2.3: $(\epsilon, \delta)$-Differential Privacy
  • Definition 2.4: Gaussian mechanism
  • Definition 2.5: Post-processing property of DP
  • Definition 2.6: Dataset
  • Definition 2.7: Domain
  • Definition 2.8: Marginals
  • Definition 2.9: Workload
  • Proposition 3.1
  • ...and 6 more