Table of Contents
Fetching ...

Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov

TL;DR

This paper addresses the challenge of achieving both strong optimization guarantees and differential privacy in federated learning with data heterogeneity. It introduces Clip21-SGD2M, a method that combines clipping, EF21-style error feedback, and double momentum to stabilize updates and control DP-noise accumulation. Theoretical results establish optimal $O(1/T)$ convergence with full gradients, a near-optimal $\tilde{O}(1/\sqrt{nT})$ rate for stochastic gradients, and a near-optimal local DP-utility trade-off under DP-noise, while empirical experiments on non-convex logistics and neural networks validate its practical advantages over Clip-SGD and Clip21-SGD. The approach thus advances privacy-preserving FL by delivering robust optimization performance under realistic heterogeneity and privacy constraints, with potential for extension to heavy-tailed noise and adaptive optimization variants.

Abstract

Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

TL;DR

This paper addresses the challenge of achieving both strong optimization guarantees and differential privacy in federated learning with data heterogeneity. It introduces Clip21-SGD2M, a method that combines clipping, EF21-style error feedback, and double momentum to stabilize updates and control DP-noise accumulation. Theoretical results establish optimal convergence with full gradients, a near-optimal rate for stochastic gradients, and a near-optimal local DP-utility trade-off under DP-noise, while empirical experiments on non-convex logistics and neural networks validate its practical advantages over Clip-SGD and Clip21-SGD. The approach thus advances privacy-preserving FL by delivering robust optimization performance under realistic heterogeneity and privacy constraints, with potential for extension to heavy-tailed noise and adaptive optimization variants.

Abstract

Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

Paper Structure

This paper contains 44 sections, 28 theorems, 202 equations, 14 figures, 3 algorithms.

Key Result

Theorem 1

Let $L, \sigma > 0,$$0 < \gamma \le 1/L, n=1$. There exists a convex, $L$-smooth problem, clipping parameter $\tau < 3\sigma\sqrt{3}/10$, and an unbiased stochastic gradient satisfying Assumption asmp:batch_noise such that the method eq:clip21_ideal is run with a stepsize $\gamma$ and clipping param Moreover, fix $0 < \varepsilon < L/\sqrt{2}$ and $x^0 = (0,-1)^\top.$ Let the sub-Gaussian varianc

Figures (14)

  • Figure 1: Left: behavior of stochastic Clip21-SGD and Clip21-SGD2M without DP noise (see Algorithm \ref{['alg:Clip-SGDM']}) initialized at $x^0 = (0, -0.07)^\top$, with stepsize $\gamma = 1/\sqrt{T}$ where $T=10^4$, i.e., close to the solution and small stepsize. We observe that Clip21-SGD escapes the good neighborhood of the solution for the problem from Theorem \ref{['th:clip21_non_convergence']} with $n=1, L=2, \sigma=5,$ and varying $\tau \in\{1,0.1,0.01\}.$ In contrast, Clip21-SGD2M remains stable around the solution. Right: convergence of Clip21-SGD does not improve with the increase of $n$ for the same problem.
  • Figure 2: Comparison of tuned Clip-SGD, Clip21-SGD, and Clip21-SGD2M on logistic regression with non-convex regularization for various clipping radii $\tau$ with mini-batch ( two left) and Gaussian-added ( two right) stochastic gradients. The final gradient norm is averaged over the last $100$ iterations. The gradient norm dynamics are reported in \ref{['fig:logreg_convergence_plots']}.
  • Figure 3: Comparison of tuned Clip-SGD, Clip21-SGD, and Clip21-SGD2M on training Resnet20 ( two left) and VGG16 ( two right) models on CIFAR10 dataset where the clipping is applied globally. The train loss and test accuracy dynamics are reported in \ref{['fig:vgg16_cifar10']} and \ref{['fig:resnet20_cifar10']}.
  • Figure 4: Comparison of tuned Clip-SGD, Clip21-SGD, and Clip21-SGD2M on training Resnet20 ( two left) and VGG16 ( two right) models on CIFAR10 dataset where the clipping is applied layer-wise. The training loss and test accuracy dynamics are presented in \ref{['fig:vgg16_cifar10_layerwise']} and \ref{['fig:resnet20_cifar10_layerwise']}.
  • Figure 5: Comparison of tuned Clip-SGD, Clip21-SGD, and Clip21-SGD2M on training CNN ( two left) and MLP ( two right) models on MNIST dataset varying the noise-clipping ration where the clipping is applied globally. The training loss and test accuracy dynamics are presented in \ref{['fig:conv_plots_cnn_dp_test_acc']}, \ref{['fig:conv_plots_cnn_dp_train_loss']}, \ref{['fig:conv_plots_mlp_dp_test_acc']}, and \ref{['fig:conv_plots_mlp_dp_train_loss']}.
  • ...and 9 more figures

Theorems & Definitions (53)

  • Definition 1: $(\varepsilon,\delta)$-Differential Privacy dwork2014algorithmic
  • Example 1: Non-Convergence of Clip-GD chen2020understanding
  • Theorem 1
  • Theorem 2: Simplified
  • proof : Proof sketch
  • Theorem 3: Simplified
  • proof : Proof sketch
  • Theorem 4
  • Corollary 1
  • Lemma 1: Lemma C.3 in gorbunov2019optimal
  • ...and 43 more