Table of Contents
Fetching ...

DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation

Chengkun Wei, Weixian Li, Chen Gong, Wenzhi Chen

TL;DR

DC-SGD tackles the challenge of choosing the DP clipping threshold $C$ in DP-SGD by estimating the gradient-norm distribution privately through histograms and dynamically updating $C$ with two mechanisms: DC-SGD-P (percentile-based) and DC-SGD-E (minimum expected squared error). The framework preserves DP by publishing noisy gradient-norm histograms and reusing a split-noise privacy accounting, and it provides formal privacy and convergence guarantees. Empirically, DC-SGD reduces hyperparameter-tuning overhead by up to $9\times$ and delivers notable accuracy gains (e.g., up to $10.62\%$ on CIFAR10 with the same privacy budget) while remaining compatible with Adam. Overall, DC-SGD offers a practical, efficient approach to private deep learning with reduced tuning burden and robust privacy-utility guarantees.

Abstract

Differentially Private Stochastic Gradient Descent (DP-SGD) is a widely adopted technique for privacy-preserving deep learning. A critical challenge in DP-SGD is selecting the optimal clipping threshold C, which involves balancing the trade-off between clipping bias and noise magnitude, incurring substantial privacy and computing overhead during hyperparameter tuning. In this paper, we propose Dynamic Clipping DP-SGD (DC-SGD), a framework that leverages differentially private histograms to estimate gradient norm distributions and dynamically adjust the clipping threshold C. Our framework includes two novel mechanisms: DC-SGD-P and DC-SGD-E. DC-SGD-P adjusts the clipping threshold based on a percentile of gradient norms, while DC-SGD-E minimizes the expected squared error of gradients to optimize C. These dynamic adjustments significantly reduce the burden of hyperparameter tuning C. The extensive experiments on various deep learning tasks, including image classification and natural language processing, show that our proposed dynamic algorithms achieve up to 9 times acceleration on hyperparameter tuning than DP-SGD. And DC-SGD-E can achieve an accuracy improvement of 10.62% on CIFAR10 than DP-SGD under the same privacy budget of hyperparameter tuning. We conduct rigorous theoretical privacy and convergence analyses, showing that our methods seamlessly integrate with the Adam optimizer. Our results highlight the robust performance and efficiency of DC-SGD, offering a practical solution for differentially private deep learning with reduced computational overhead and enhanced privacy guarantees.

DC-SGD: Differentially Private SGD with Dynamic Clipping through Gradient Norm Distribution Estimation

TL;DR

DC-SGD tackles the challenge of choosing the DP clipping threshold in DP-SGD by estimating the gradient-norm distribution privately through histograms and dynamically updating with two mechanisms: DC-SGD-P (percentile-based) and DC-SGD-E (minimum expected squared error). The framework preserves DP by publishing noisy gradient-norm histograms and reusing a split-noise privacy accounting, and it provides formal privacy and convergence guarantees. Empirically, DC-SGD reduces hyperparameter-tuning overhead by up to and delivers notable accuracy gains (e.g., up to on CIFAR10 with the same privacy budget) while remaining compatible with Adam. Overall, DC-SGD offers a practical, efficient approach to private deep learning with reduced tuning burden and robust privacy-utility guarantees.

Abstract

Differentially Private Stochastic Gradient Descent (DP-SGD) is a widely adopted technique for privacy-preserving deep learning. A critical challenge in DP-SGD is selecting the optimal clipping threshold C, which involves balancing the trade-off between clipping bias and noise magnitude, incurring substantial privacy and computing overhead during hyperparameter tuning. In this paper, we propose Dynamic Clipping DP-SGD (DC-SGD), a framework that leverages differentially private histograms to estimate gradient norm distributions and dynamically adjust the clipping threshold C. Our framework includes two novel mechanisms: DC-SGD-P and DC-SGD-E. DC-SGD-P adjusts the clipping threshold based on a percentile of gradient norms, while DC-SGD-E minimizes the expected squared error of gradients to optimize C. These dynamic adjustments significantly reduce the burden of hyperparameter tuning C. The extensive experiments on various deep learning tasks, including image classification and natural language processing, show that our proposed dynamic algorithms achieve up to 9 times acceleration on hyperparameter tuning than DP-SGD. And DC-SGD-E can achieve an accuracy improvement of 10.62% on CIFAR10 than DP-SGD under the same privacy budget of hyperparameter tuning. We conduct rigorous theoretical privacy and convergence analyses, showing that our methods seamlessly integrate with the Adam optimizer. Our results highlight the robust performance and efficiency of DC-SGD, offering a practical solution for differentially private deep learning with reduced computational overhead and enhanced privacy guarantees.

Paper Structure

This paper contains 32 sections, 31 equations, 11 figures, 8 tables, 4 algorithms.

Figures (11)

  • Figure 1: Classification accuracy of ResNet18 trained on the CIFAR10 dataset, using varying clipping thresholds $C$, learning rates $\eta$, $\epsilon=8$ and SGD optimizer within the DP-SGD framework.
  • Figure 2: $C_t$ according to different percentile $p$ on a set of 256 random synthetic data drawn from $\mathcal{N}(100,20^2)$ with different histogram construction and noise level. Each subfigure uses a different histogram bin count $b$ and generates data separately.
  • Figure 3: $E_{t,C_t}$ of different $C_t$ on a set of 256 random synthetic data from $\mathcal{N}(100,20^2)$ with different histogram structures and noise levels. To compute the variance term, $\sigma_T=1, B=256, d=100000$. The histograms all have $R=120$. Each subfigure uses a different histogram bin count $b$ and generates data separately.
  • Figure 4: $E_{t,C_t}$ of different $C_t$ at some iteration for CIFAR10 on ResNet18, SVHN on ResNet34, MNIST on CNN. Privacy budget $(\epsilon=8,\delta=1/|D|)$, batch size $B=256$, clipping threshold $C=1$, using default Adam optimizer.
  • Figure 5: The accuracy of different $p$ for DC-SGD-P on SVHN and QNLI.
  • ...and 6 more figures

Theorems & Definitions (1)

  • proof