Table of Contents
Fetching ...

The importance of feature preprocessing for differentially private linear optimization

Ziteng Sun, Ananda Theertha Suresh, Aditya Krishna Menon

TL;DR

This work analyzes whether DPSGD alone suffices for good private optimization in linear classification. It proves a counterexample where DPSGD’s excess risk scales with the maximum feature norm $R$, implying private feature preprocessing is necessary for near-optimal performance. The authors introduce DPSGD-F, a private feature preprocessing augmented DPSGD algorithm, and show its leading error scales with the dataset diameter $\mathrm{diam}(D)$, with a matching information-theoretic lower bound. Empirical results on MNIST, Fashion-MNIST, and CIFAR-100 (with pretrained features) demonstrate practical gains over standard DPSGD under privacy budgets, underscoring the value of feature preprocessing in private optimization.

Abstract

Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? Towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs an optimality gap proportional to the maximum Euclidean norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs an optimality gap proportional to the diameter of the features $\max_{x, x' \in D} \|x - x'\|_2$. We finally demonstrate the practicality of our algorithm on image classification benchmarks.

The importance of feature preprocessing for differentially private linear optimization

TL;DR

This work analyzes whether DPSGD alone suffices for good private optimization in linear classification. It proves a counterexample where DPSGD’s excess risk scales with the maximum feature norm , implying private feature preprocessing is necessary for near-optimal performance. The authors introduce DPSGD-F, a private feature preprocessing augmented DPSGD algorithm, and show its leading error scales with the dataset diameter , with a matching information-theoretic lower bound. Empirical results on MNIST, Fashion-MNIST, and CIFAR-100 (with pretrained features) demonstrate practical gains over standard DPSGD under privacy budgets, underscoring the value of feature preprocessing in private optimization.

Abstract

Training machine learning models with differential privacy (DP) has received increasing interest in recent years. One of the most popular algorithms for training differentially private models is differentially private stochastic gradient descent (DPSGD) and its variants, where at each step gradients are clipped and combined with some noise. Given the increasing usage of DPSGD, we ask the question: is DPSGD alone sufficient to find a good minimizer for every dataset under privacy constraints? Towards answering this question, we show that even for the simple case of linear classification, unlike non-private optimization, (private) feature preprocessing is vital for differentially private optimization. In detail, we first show theoretically that there exists an example where without feature preprocessing, DPSGD incurs an optimality gap proportional to the maximum Euclidean norm of features over all samples. We then propose an algorithm called DPSGD-F, which combines DPSGD with feature preprocessing and prove that for classification tasks, it incurs an optimality gap proportional to the diameter of the features . We finally demonstrate the practicality of our algorithm on image classification benchmarks.
Paper Structure (23 sections, 13 theorems, 68 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 23 sections, 13 theorems, 68 equations, 2 figures, 2 tables, 3 algorithms.

Key Result

Lemma 1

There exists an $(\varepsilon, \delta)$-DP instance of DPSGD, whose output satisfies, where $G$ is the Lipschitz constant of the loss function,$R = \max_i \|x_i\|_2$ and $M = M(D')$ with $D' = \{(x, 1) \mid x \in D \}$.

Figures (2)

  • Figure 1: Feature vectors.
  • Figure 2: Gradient vectors.

Theorems & Definitions (15)

  • Definition 1: Differential privacy
  • Lemma 1: song2020characterizing
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Lemma 2
  • Lemma 3: Appendix \ref{['app:translate']}
  • Lemma 4: Appendix \ref{['app:priv_quantile']}
  • Lemma 5
  • proof
  • ...and 5 more