Table of Contents
Fetching ...

Privacy for Free in the Overparameterized Regime

Simone Bombari, Marco Mondelli

TL;DR

The paper tackles the privacy-utility trade-off of differentially private gradient descent (DP-GD) in over-parameterized regimes, focusing on random features with a quadratic loss. By modeling DP-GD as an Euler-Maruyama discretization of a stochastic differential equation and leveraging clipping analysis, leave-one-out techniques, and an Ornstein–Uhlenbeck process, the authors prove that privacy can be achieved for free in suitably large RF models: the excess population risk $R_P$ vanishes as $o(1)$ even when the privacy budget satisfies $\varepsilon = o(1)$, provided $d \ll n \ll d^{3/2}$ and $p$ is large (e.g., $p \ge n^2$). The main result yields a nonasymptotic bound $R_P = \tilde{O}\big( d/(n\varepsilon) + \sqrt{d/n} + \sqrt{n/d^{3/2}} \big)$, independent of $p$ up to polylog factors, clarifying how over-parameterization interacts with DP in practice. Numerical experiments on MNIST and synthetic data corroborate the theory, showing DP-GD's test performance plateaus and improves with width, resembling ridge regularization due to early stopping and noise. Overall, the work provides principled guidance on hyperparameter scaling for private training in high-dimensional regimes and motivates extending the analysis beyond RF models to more general deep architectures.

Abstract

Differentially private gradient descent (DP-GD) is a popular algorithm to train deep learning models with provable guarantees on the privacy of the training data. In the last decade, the problem of understanding its performance cost with respect to standard GD has received remarkable attention from the research community, which formally derived upper bounds on the excess population risk $R_{P}$ in different learning settings. However, existing bounds typically degrade with over-parameterization, i.e., as the number of parameters $p$ gets larger than the number of training samples $n$ -- a regime which is ubiquitous in current deep-learning practice. As a result, the lack of theoretical insights leaves practitioners without clear guidance, leading some to reduce the effective number of trainable parameters to improve performance, while others use larger models to achieve better results through scale. In this work, we show that in the popular random features model with quadratic loss, for any sufficiently large $p$, privacy can be obtained for free, i.e., $\left|R_{P} \right| = o(1)$, not only when the privacy parameter $\varepsilon$ has constant order, but also in the strongly private setting $\varepsilon = o(1)$. This challenges the common wisdom that over-parameterization inherently hinders performance in private learning.

Privacy for Free in the Overparameterized Regime

TL;DR

The paper tackles the privacy-utility trade-off of differentially private gradient descent (DP-GD) in over-parameterized regimes, focusing on random features with a quadratic loss. By modeling DP-GD as an Euler-Maruyama discretization of a stochastic differential equation and leveraging clipping analysis, leave-one-out techniques, and an Ornstein–Uhlenbeck process, the authors prove that privacy can be achieved for free in suitably large RF models: the excess population risk vanishes as even when the privacy budget satisfies , provided and is large (e.g., ). The main result yields a nonasymptotic bound , independent of up to polylog factors, clarifying how over-parameterization interacts with DP in practice. Numerical experiments on MNIST and synthetic data corroborate the theory, showing DP-GD's test performance plateaus and improves with width, resembling ridge regularization due to early stopping and noise. Overall, the work provides principled guidance on hyperparameter scaling for private training in high-dimensional regimes and motivates extending the analysis beyond RF models to more general deep architectures.

Abstract

Differentially private gradient descent (DP-GD) is a popular algorithm to train deep learning models with provable guarantees on the privacy of the training data. In the last decade, the problem of understanding its performance cost with respect to standard GD has received remarkable attention from the research community, which formally derived upper bounds on the excess population risk in different learning settings. However, existing bounds typically degrade with over-parameterization, i.e., as the number of parameters gets larger than the number of training samples -- a regime which is ubiquitous in current deep-learning practice. As a result, the lack of theoretical insights leaves practitioners without clear guidance, leading some to reduce the effective number of trainable parameters to improve performance, while others use larger models to achieve better results through scale. In this work, we show that in the popular random features model with quadratic loss, for any sufficiently large , privacy can be obtained for free, i.e., , not only when the privacy parameter has constant order, but also in the strongly private setting . This challenges the common wisdom that over-parameterization inherently hinders performance in private learning.

Paper Structure

This paper contains 30 sections, 36 theorems, 351 equations, 4 figures, 1 algorithm.

Key Result

Theorem 1

Consider the RF model in eq:rfmodelintro with input dimension $d$ and number of features $p$. Let $n$ be the number of training samples and $\mathcal{R}_{\textup{P}}$ be defined according to eq:excess, where $\theta^*$ is the solution of GD and $\theta^p$ is the $(\varepsilon, \delta)$-differentiall

Figures (4)

  • Figure 1: Test accuracy of DP-GD on MNIST for a 2-layer, fully connected ReLU network, as a function of the number of parameters $p$ with fixed $n = 50000$. Further details on the experimental setting can be found in Section \ref{['sec:experiments']}.
  • Figure 2: Test accuracy of DP-GD on MNIST for a 2-layer, fully connected ReLU network, as a function of the number of training samples $n$ with fixed hidden layer width $= 1000$. Further details on the experimental setting can be found in Section \ref{['sec:experiments']}.
  • Figure 3: Experiments on RF models with $\tanh$ activation, and synthetic data sampled from a standard Gaussian distribution with $d = 100$. The learning task is given by $y = \textup{sign}(u^\top x)$, where $u \in \mathbb{R}^d$ is a fixed vector sampled from the unit sphere, and we consider a fixed number of training samples $n = 2000$. $\theta^p$ is the solution of Algorithm \ref{['alg:dp-gd']} with $\varepsilon = 4$, $\delta = 1/n$, and $\theta^*$ is the solution of GD, both with small enough learning rate $\eta$. First panel: test losses of $\theta^p$ and $\theta^*$ for different number of parameters $p$. Second panel: test loss for $\theta^p$ as a function of the number of training iterations $T$. Third panel: Same plot as in the second panel, with the $x$-axis set to be $\eta T p / d$. Fourth panel: test loss of $\theta^p$ for a fixed $p= 40000$, as a function of the hyper-parameters $(C_{\textup{clip}}, T)$.
  • Figure 4: Experiments on a family of 2-layer fully-connected ReLU networks trained with cross-entropy loss on the MNIST classification task ($d=768$ and we fix $n = 50000$), with privacy budget $\varepsilon = 1$, $\delta = 1/n$. First panel: validation error as a function of the number of training iterations $T$. Second panel: validation error for a hidden-layer width of 1000 ($p \sim 10^6$), as a function of the hyper-parameters $(C_{\textup{clip}}, T)$.

Theorems & Definitions (62)

  • Theorem : main result -- informal
  • Definition 3.1: $(\varepsilon, \delta)$-DP dwork2006
  • Proposition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Theorem 1
  • Lemma 6.1
  • Lemma 6.2
  • Lemma 6.3
  • Lemma 6.4
  • ...and 52 more