Table of Contents
Fetching ...

On the Convergence of DP-SGD with Adaptive Clipping

Egor Shulgin, Peter Richtárik

TL;DR

The paper addresses the sensitivity of DP-SGD to the clipping threshold by introducing SGD with Quantile Clipping (QC-SGD), which uses a $p$-quantile based threshold $\tau(x)$ to clip gradients. It provides a rigorous convergence analysis under $L$-smoothness and heavy-tailed noise, revealing a clipping-induced bias analogous to fixed clipping and showing that a time-varying schedule for both the quantile and the stepsize can eliminate this bias and ensure convergence to a stationary point. The authors extend the result to a Differential Privacy setting (DP-QC-SGD), deriving DP-aware convergence guarantees that incorporate the effective noise scale $\mathfrak{S}=1/B+\sigma_{\mathrm{DP}}^2$, and discuss the trade-offs with privacy-utility and mini-batching. Overall, the work establishes a theoretical foundation for adaptive clipping techniques, provides practical guidelines for parameter selection, and outlines future directions for robust/private adaptive clipping.

Abstract

Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, such as quantile clipping, which have demonstrated empirical success but lack a solid theoretical understanding. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD). We demonstrate that QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but show how this can be mitigated through a carefully designed quantile and step size schedule. Our analysis reveals crucial relationships between quantile selection, step size, and convergence behavior, providing practical guidelines for parameter selection. We extend these results to differentially private optimization, establishing the first theoretical guarantees for DP-QC-SGD. Our findings provide theoretical foundations for widely used adaptive clipping heuristic and highlight open avenues for future research.

On the Convergence of DP-SGD with Adaptive Clipping

TL;DR

The paper addresses the sensitivity of DP-SGD to the clipping threshold by introducing SGD with Quantile Clipping (QC-SGD), which uses a -quantile based threshold to clip gradients. It provides a rigorous convergence analysis under -smoothness and heavy-tailed noise, revealing a clipping-induced bias analogous to fixed clipping and showing that a time-varying schedule for both the quantile and the stepsize can eliminate this bias and ensure convergence to a stationary point. The authors extend the result to a Differential Privacy setting (DP-QC-SGD), deriving DP-aware convergence guarantees that incorporate the effective noise scale , and discuss the trade-offs with privacy-utility and mini-batching. Overall, the work establishes a theoretical foundation for adaptive clipping techniques, provides practical guidelines for parameter selection, and outlines future directions for robust/private adaptive clipping.

Abstract

Stochastic Gradient Descent (SGD) with gradient clipping is a powerful technique for enabling differentially private optimization. Although prior works extensively investigated clipping with a constant threshold, private training remains highly sensitive to threshold selection, which can be expensive or even infeasible to tune. This sensitivity motivates the development of adaptive approaches, such as quantile clipping, which have demonstrated empirical success but lack a solid theoretical understanding. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD). We demonstrate that QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but show how this can be mitigated through a carefully designed quantile and step size schedule. Our analysis reveals crucial relationships between quantile selection, step size, and convergence behavior, providing practical guidelines for parameter selection. We extend these results to differentially private optimization, establishing the first theoretical guarantees for DP-QC-SGD. Our findings provide theoretical foundations for widely used adaptive clipping heuristic and highlight open avenues for future research.
Paper Structure (17 sections, 5 theorems, 46 equations, 1 figure)

This paper contains 17 sections, 5 theorems, 46 equations, 1 figure.

Key Result

Lemma 1

Assume that stochastic gradient estimator $\nabla f_\xi(x)$ satisfies Assumption ass:b_variance, $\alpha_{\xi}(x)$ is chosen as eq:alpha, and $p$-th quantile clipping threshold $\tau(x)$ satisfies eq:quantile. Then for all $x \in \mathbb{R}^d$, where $\mathop{\mathrm{\overline{\alpha}}}\nolimits(x) \coloneqq {\color{black} \mathbb{E}}\left[\alpha_{\xi}(x) \right]$.

Figures (1)

  • Figure 1: Evolution of the adaptive clipping norm at five different quantiles (0.1, 0.3, 0.5, 0.7, 0.9) on six federated learning problems without Differential Privacy noise. Note that each task has a unique shape (e.g., increasing and decreasing) to its update norm evolution, which further motivates an adaptive approach. The figure is taken from the paper by andrew2021differentially.

Theorems & Definitions (7)

  • Lemma 1: merad2024robust
  • Lemma 2
  • Theorem 1: General case
  • Corollary 1: Constant parameters
  • Example 1
  • Theorem 2: DP-QC-SGD
  • proof