Private Regression via Data-Dependent Sufficient Statistic Perturbation
Cecilia Ferrando, Daniel Sheldon
TL;DR
This work introduces data-dependent SSP (DD-SSP), which privately releases marginals via the AIM mechanism to estimate sufficient statistics for linear regression, yielding higher utility than traditional data-independent SSP AdaSSP. It extends the paradigm to logistic regression by a Chebyshev polynomial approximation that expresses the objective in terms of approximate sufficient statistics derived from pairwise marginals. Empirical results show DD-SSP matches or outperforms baselines in linear regression and offers competitive performance for logistic regression, closely aligning with AIM-Synth, which trains on DP synthetic data. The authors also establish a conceptual link between DP synthetic data and data-dependent SSP, and discuss extensions to broader model classes and continuous data mechanisms.
Abstract
Sufficient statistic perturbation (SSP) is a widely used method for differentially private linear regression. SSP adopts a data-independent approach where privacy noise from a simple distribution is added to sufficient statistics. However, sufficient statistics can often be expressed as linear queries and better approximated by data-dependent mechanisms. In this paper we introduce data-dependent SSP for linear regression based on post-processing privately released marginals, and find that it outperforms state-of-the-art data-independent SSP. We extend this result to logistic regression by developing an approximate objective that can be expressed in terms of sufficient statistics, resulting in a novel and highly competitive SSP approach for logistic regression. We also make a connection to synthetic data for machine learning: for models with sufficient statistics, training on synthetic data corresponds to data-dependent SSP, with the overall utility determined by how well the mechanism answers these linear queries.
