Table of Contents
Fetching ...

Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training

Tom Sander, Maxime Sylvestre, Alain Durmus

TL;DR

The paper investigates implicit bias in Noisy-SGD, positioning it as a proxy for DP-SGD to understand how gradient-noise geometry interacts with added Gaussian perturbations. Through continuous-time analyses of Linear Least Squares and Diagonal Linear Networks, it demonstrates that the intrinsic SGD bias persists and can even be amplified by additional noise, independent of clipping. Empirical results on ImageNet with NF-ResNets and DLN sparse-regression setups corroborate the theory, showing Noisy-SGD can enhance sparsity and bias strength under certain noise regimes. The findings suggest that leveraging large-batch training techniques from non-private settings could help close the privacy-utility gap in DP-SGD, informing optimization and privacy perspectives for private deep learning.

Abstract

Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies.

Implicit Bias in Noisy-SGD: With Applications to Differentially Private Training

TL;DR

The paper investigates implicit bias in Noisy-SGD, positioning it as a proxy for DP-SGD to understand how gradient-noise geometry interacts with added Gaussian perturbations. Through continuous-time analyses of Linear Least Squares and Diagonal Linear Networks, it demonstrates that the intrinsic SGD bias persists and can even be amplified by additional noise, independent of clipping. Empirical results on ImageNet with NF-ResNets and DLN sparse-regression setups corroborate the theory, showing Noisy-SGD can enhance sparsity and bias strength under certain noise regimes. The findings suggest that leveraging large-batch training techniques from non-private settings could help close the privacy-utility gap in DP-SGD, informing optimization and privacy perspectives for private deep learning.

Abstract

Training Deep Neural Networks (DNNs) with small batches using Stochastic Gradient Descent (SGD) yields superior test performance compared to larger batches. The specific noise structure inherent to SGD is known to be responsible for this implicit bias. DP-SGD, used to ensure differential privacy (DP) in DNNs' training, adds Gaussian noise to the clipped gradients. Surprisingly, large-batch training still results in a significant decrease in performance, which poses an important challenge because strong DP guarantees necessitate the use of massive batches. We first show that the phenomenon extends to Noisy-SGD (DP-SGD without clipping), suggesting that the stochasticity (and not the clipping) is the cause of this implicit bias, even with additional isotropic Gaussian noise. We theoretically analyse the solutions obtained with continuous versions of Noisy-SGD for the Linear Least Square and Diagonal Linear Network settings, and reveal that the implicit bias is indeed amplified by the additional noise. Thus, the performance issues of large-batch DP-SGD training are rooted in the same underlying principles as SGD, offering hope for potential improvements in large batch training strategies.
Paper Structure (49 sections, 13 theorems, 96 equations, 7 figures)

This paper contains 49 sections, 13 theorems, 96 equations, 7 figures.

Key Result

theorem 1

If $\gamma \leq 1/Tr(\Bar{X}^T\Bar{X})$ then

Figures (7)

  • Figure 1: Training from scratch on ImageNet for $S=72k$ steps, using a constant learning rate, with different batch sizes $B$. The effective noise $\sigma/B$ is constant within DP-SGD and Noisy-SGD experiments. The crosses for DP-SGD are obtained from sander2023tan. We observe a similar phenomenon for the non-clipped version (Noisy-SGD), i.e., small batches perform better that larger ones, suggesting that clipping is not solely responsible. Even with isotropic noise added with greater magnitude than the gradients, SGD's implicit bias persists: the natural noise structure in SGD is robust to Gaussian perturbations.
  • Figure 2: Noisy-SGD on ImageNet. We compare the norm of the mini-batch gradient to the one of the Gaussian noise when training with Noisy-SGD on ImageNet, for $B=128$ and the same set-up as in Figure \ref{['fig:Figure1']}. The noise magnitude is greater than the gradients.
  • Figure 3: Diagonal Linear Network: Implicit Bias of GD, SGD and Noisy SGD ($\sigma=0.5$ in Equation \ref{['eq:noisy_SGD_DLN']}). Shaded areas represent one standard deviation over 5 runs. (Left) Compared to GD with the same initialisation $\alpha=0.1$, SGD attain solutions closer to the sparse $\beta^*_{l_0}$, as expected from pesme2021implicit. Moreover, we observe that Noisy-SGD has a better implicit bias than SGD: the gradient noise structure is enhanced by perturbations. (Right) In absolute terms, Noisy-SGD does not even showcase more variance than SGD (near convergence).
  • Figure 4: DLN: Distance between $\beta^*_{\alpha_\infty}$, the solution that minimizes $\phi_{\alpha_\infty}$ ---obtained by GD from $\alpha_\infty$--- and the one obtained by Noisy-SGD (see Proposition \ref{['eq:impact_eff_init']}). Shaded areas represent one standard deviation over 10 runs. For small $\sigma$, the distance is smaller than the distance between the solutions of SGD and the sparse solution $\beta_{l_0}$ (see Figure \ref{['fig:Figure3']}), explaining why the implicit bias persists and can be enhanced by Gaussian noise.
  • Figure 5: Diagonal Linear Network from $\alpha=0.1$: Implicit Bias of SGD and Noisy SGD different values of $\sigma$ in Equation \ref{['eq:noisy_SGD_DLN']}, from $\alpha=0.1$. Shaded areas represent one standard variation over 5 runs, and plain lines represent the average values. Starting from this initialization, the bigger $\sigma$ is, the closer the solution obtained with Noisy-SGD is to the sparse solution $\beta^*_{l_0}$.
  • ...and 2 more figures

Theorems & Definitions (28)

  • definition 1: Approximate Differential Privacy
  • theorem 1
  • proof
  • Proposition 1
  • proof
  • theorem 2
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • ...and 18 more