Table of Contents
Fetching ...

Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

Simone Bombari, Inbar Seroussi, Marco Mondelli

TL;DR

This work provides sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently, and demonstrates the optimality of aggressive clipping.

Abstract

Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several works aimed at obtaining improved error rates. A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients. While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance. Our work bridges this gap between theory and practice: we provide sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently. Specifically, we consider the setting where the data is multivariate Gaussian, the number of training samples $n$ is proportional to the input dimension $d$, and the algorithm guarantees constant-order zero concentrated DP. Our method relies on establishing a deterministic equivalent for the trajectory of DP-SGD in terms of a family of ordinary differential equations (ODEs). As a consequence, the risk of DP-SGD is bounded between two ODEs, with upper and lower bounds matching for isotropic data. By studying these ODEs when $n / d$ is large enough, we demonstrate the optimality of aggressive clipping, and we uncover the benefits of decaying learning rate and private noise scheduling.

Better Rates for Private Linear Regression in the Proportional Regime via Aggressive Clipping

TL;DR

This work provides sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently, and demonstrates the optimality of aggressive clipping.

Abstract

Differentially private (DP) linear regression has received significant attention in the recent theoretical literature, with several works aimed at obtaining improved error rates. A common approach is to set the clipping constant much larger than the expected norm of the per-sample gradients. While simplifying the analysis, this is however in sharp contrast with what empirical evidence suggests to optimize performance. Our work bridges this gap between theory and practice: we provide sharper rates for DP stochastic gradient descent (DP-SGD) by crucially operating in a regime where clipping happens frequently. Specifically, we consider the setting where the data is multivariate Gaussian, the number of training samples is proportional to the input dimension , and the algorithm guarantees constant-order zero concentrated DP. Our method relies on establishing a deterministic equivalent for the trajectory of DP-SGD in terms of a family of ordinary differential equations (ODEs). As a consequence, the risk of DP-SGD is bounded between two ODEs, with upper and lower bounds matching for isotropic data. By studying these ODEs when is large enough, we demonstrate the optimality of aggressive clipping, and we uncover the benefits of decaying learning rate and private noise scheduling.

Paper Structure

This paper contains 36 sections, 18 theorems, 205 equations, 4 figures, 1 algorithm.

Key Result

Proposition 4.1

Algorithm alg:dp-sgd satisfies $(\rho^2 / 2)$-zCDP, where

Figures (4)

  • Figure 1: Numerical simulations for Algorithm \ref{['alg:dp-sgd']} ($d = 10, 100, 1000$) and the ODEs in \ref{['eq:upperlowerODEsbody']}. We consider two schedules of the form in \ref{['eq:polynomialsched']}: output perturbation ($\alpha = 0$, first and second panel) and DP-SGD with constant noise ($\alpha = 1/ 2$, third and fourth panel). We fix $\gamma = 0.1$, $\rho = 1$, $\zeta = 0.3$, $\tilde{\eta}(0) = 3$, $c = 1$, and consider both isotropic data ($\kappa = 1$, first and third panel) and data covariance with condition number $k = 2$ (second and fourth panel). For $\alpha = 0$, we also report for $t \geq 1$ the risk $\mathcal{R}(\theta^p)$, and the values of $\overline R(1) + 2 c^2 \tilde{\eta}^2(1) \gamma^2 / \rho^2$ and $\underline R(1) + 2 c^2 \tilde{\eta}^2(1) \gamma^2 / \rho^2$ with a red continuous and dashed line respectively. $\theta^*$ is sampled uniformly on the unit sphere and the spectrum of $\Sigma$ follows a power law with appropriate exponent to achieve the specified value of $\kappa$. For each value of $d$, we report bands corresponding to 1 standard deviation around the mean over 10 independent trials of Algorithm \ref{['alg:dp-sgd']}. In the first and third panel, we have $\overline R(t) = \underline R(t)$ as the ODEs in \ref{['eq:upperlowerODEsbody']} match.
  • Figure 2: Numerical simulations for $\mathcal{R}(\theta^p)$ obtained via Algorithm \ref{['alg:dp-sgd']} for $d = 1000$ and $\kappa = 1$, as a function of $c$ and $\tilde{\eta}(0)$. We consider the schedules in \ref{['eq:polynomialsched']} corresponding to output perturbation ($\alpha = 0$, first and second panel) and DP-SGD with constant noise ($\alpha = 1/ 2$, third and fourth panel), with fixed $\rho = 1$ and $\zeta = 0.3$. We set $\gamma = 0.1$ in the first and third panel, and $\gamma = 0.01$ in the second and fourth panel. The values of $\mathcal{R}(\theta^p)$ are capped at $1$, and $\theta^*$ is chosen such that $\mathcal{R}(\theta_0) = 0.5$. We indicate with red dashed lines the curves $c = 1$, $\tilde{\eta}(0) = 2 / \gamma$, and $c \tilde{\eta}(0) = \ln(1 / \gamma)$, and we display the average over 10 independent trials.
  • Figure 3: Numerical simulations for $\mathcal{R}(\theta^p_\alpha)$ obtained via Algorithm \ref{['alg:dp-sgd']} for $d = 1000$, $\kappa = 1$ and $\rho = 1$, as a function of $n$. We consider the schedules in \ref{['eq:polynomialsched']} for $\alpha \in \{0, 0.5, 1, 2\}$, we optimize w.r.t. $c$ and $\tilde{\eta}(0)$, and we report the average over 10 independent trials, as well as the confidence interval corresponding to 1 standard deviation.
  • Figure 4: The functions $\mu_c(R)$, $\mu_c(R) / c'$, $\nu_c(R)$, $\nu_c(R) / c'$, plotted as a function of $c' = c / \sqrt{2R + \zeta^2}$.

Theorems & Definitions (33)

  • Definition 3.1: Zero concentrated DP bun2016concentrated
  • Proposition 4.1
  • Definition 4.2: Homogenized DP-SGD
  • Theorem 1
  • Proposition 4.3
  • Proposition 4.4
  • Theorem 2
  • Theorem 3
  • Definition A.1: $(\varepsilon, \delta)$-DP dwork2006
  • Definition A.2: Rényi DP Mironov2017
  • ...and 23 more