Table of Contents
Fetching ...

SoK: A Review of Differentially Private Linear Models For High-Dimensional Data

Amol Khanna, Edward Raff, Nathan Inkawhich

TL;DR

The paper addresses the challenge of training differential privacy (DP) linear models in high-dimensional settings where $n<d$, where overfitting and privacy leakage are prominent. It surveys and categorizes optimization methods (Model Selection, Frank–Wolfe, Compressed Learning, ADMM, Thresholding, Coordinate Descent, Mirror Descent), and provides a systematic empirical comparison across six datasets for linear and logistic regression under various DP budgets, with code released for reproducibility. A key finding is that methods accounting for per-feature scale and using robust or coordinate-wise updates often outperform Lipschitz-based approaches, but computational cost and ambiguous regularization effects remain major hurdles. The work offers practical guidance for future DP high-dimensional modeling and establishes a benchmark framework to evaluate new methods, enabling more rapid progress in privacy-preserving high-dimensional statistics.

Abstract

Linear models are ubiquitous in data science, but are particularly prone to overfitting and data memorization in high dimensions. To guarantee the privacy of training data, differential privacy can be used. Many papers have proposed optimization techniques for high-dimensional differentially private linear models, but a systematic comparison between these methods does not exist. We close this gap by providing a comprehensive review of optimization methods for private high-dimensional linear models. Empirical tests on all methods demonstrate robust and coordinate-optimized algorithms perform best, which can inform future research. Code for implementing all methods is released online.

SoK: A Review of Differentially Private Linear Models For High-Dimensional Data

TL;DR

The paper addresses the challenge of training differential privacy (DP) linear models in high-dimensional settings where , where overfitting and privacy leakage are prominent. It surveys and categorizes optimization methods (Model Selection, Frank–Wolfe, Compressed Learning, ADMM, Thresholding, Coordinate Descent, Mirror Descent), and provides a systematic empirical comparison across six datasets for linear and logistic regression under various DP budgets, with code released for reproducibility. A key finding is that methods accounting for per-feature scale and using robust or coordinate-wise updates often outperform Lipschitz-based approaches, but computational cost and ambiguous regularization effects remain major hurdles. The work offers practical guidance for future DP high-dimensional modeling and establishes a benchmark framework to evaluate new methods, enabling more rapid progress in privacy-preserving high-dimensional statistics.

Abstract

Linear models are ubiquitous in data science, but are particularly prone to overfitting and data memorization in high dimensions. To guarantee the privacy of training data, differential privacy can be used. Many papers have proposed optimization techniques for high-dimensional differentially private linear models, but a systematic comparison between these methods does not exist. We close this gap by providing a comprehensive review of optimization methods for private high-dimensional linear models. Empirical tests on all methods demonstrate robust and coordinate-optimized algorithms perform best, which can inform future research. Code for implementing all methods is released online.
Paper Structure (23 sections, 8 equations, 7 figures, 9 tables)

This paper contains 23 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: A taxonomy of optimization techniques used for high-dimensional DP linear models.
  • Figure 2: Bodyfat: Mean Squared Error
  • Figure 3: PAH: Mean Squared Error
  • Figure 4: E2006: Mean Squared Error
  • Figure 5: Heart: Accuracy
  • ...and 2 more figures