Table of Contents
Fetching ...

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

Zhijie Chen, Qiaobo Li, Arindam Banerjee

TL;DR

Federated learning suffers from high communication costs when deep models are used. The authors introduce Sketched Adaptive Federated Learning (SAFL), which combines unbiased gradient sketching (e.g., Count-Sketch, SRHT, Gaussian) with adaptive optimization (AMSGrad) to reduce per-round communications to $O(b)$ with a sketch size $b=O(\log d)$. They prove high-probability convergence at rate $O(1/\sqrt{T})$ under mild noise, leveraging the intrinsic Hessian dimension $\mathcal{I} = \sum_i |\lambda_i| / \max_i |\lambda_i|$ to achieve dimension-independent rates up to logarithmic factors; near initialization they also show $O(1/T)$ improvements, and SACFL extends to non-iid settings with clipping. Empirical results on vision and language benchmarks corroborate the theory, showing SAFL competitive with full-dimension adaptive methods and with error-feedback-based approaches. Overall, the work provides a practical, theory-backed path to scalable communication-efficient federated learning for large deep networks.

Abstract

Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic $O(1/\sqrt{T})$ convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback.

Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

TL;DR

Federated learning suffers from high communication costs when deep models are used. The authors introduce Sketched Adaptive Federated Learning (SAFL), which combines unbiased gradient sketching (e.g., Count-Sketch, SRHT, Gaussian) with adaptive optimization (AMSGrad) to reduce per-round communications to with a sketch size . They prove high-probability convergence at rate under mild noise, leveraging the intrinsic Hessian dimension to achieve dimension-independent rates up to logarithmic factors; near initialization they also show improvements, and SACFL extends to non-iid settings with clipping. Empirical results on vision and language benchmarks corroborate the theory, showing SAFL competitive with full-dimension adaptive methods and with error-feedback-based approaches. Overall, the work provides a practical, theory-backed path to scalable communication-efficient federated learning for large deep networks.

Abstract

Combining gradient compression methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and less per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models. In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves asymptotic convergence, and converges faster in the initial epochs. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback.

Paper Structure

This paper contains 22 sections, 20 theorems, 79 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.2

Suppose the sequence of iterates $\{x_t\}_{t=1}^T$ is generated by Algorithm alg:sketch_federated (SAFL) with a constant learning rate $\eta_t \equiv \eta$. Under Assumptions 1-4, for any $T$ and $\epsilon > 0$, with probability $1 - \Theta(\delta) - O(\exp(-\Omega(\nu^2))) -\delta_g$, where $\delta, \delta_g$, and $\nu$ are the randomness of sketching, sub-Gaussian noise, and martingales respect

Figures (6)

  • Figure 1: Model performance on CIFAR-10 with ResNet of 42M parameters. The plot starts from the 10th epoch for better demonstration; Third: Validation error on SAFL with different sketch sizes. The legend 4e7 represents training in the ambient dimension without sketching. Fourth: Training error on SAFL with different sketch sizes. Larger sketch size improves the convergence rate and the peak validation error is achieved when $b=4e4$.
  • Figure 2: Validation Error on CIFAR-10. We finetune a ViT-base model (with 86M parameters) from the pretrained backbone checkpoint dosovitskiy2020image. 1Bit-Adam has comparable compression rates with $b=8e5$. SAFL optimizer consistently outperforms in all sketch sizes.
  • Figure 3: Validation Error on SST2 (GLUE) with BERT of 100M parameters. Left: sketch size $b=2e5$; Middle: $b=2e6$; Right: ADA_OPT is Adam, with sketch size $b \in \{2e4, 2e5, 2e6\}$. The legend $1e8$ represents training in the ambient dimension without sketching. Larger sketch sizes mainly improves the convergence rate and achieve comparable test errors at the end of training.
  • Figure 4: The power-law structure of the Hessian spectrum on LeNet. Quoted from Fig.1 xie2022power.
  • Figure 5: Eigenspectrum density every 5 epochs. The model is ViT-Small and trained on CIFAR10. The majority of eigenvalues concentrates near 0 and the density enjoys a super fast decay with the absolute values of eigenvalues, indicating a summable eigenspectra.
  • ...and 1 more figures

Theorems & Definitions (29)

  • Remark 3.1
  • Definition 3.1
  • Remark 3.2
  • Theorem 3.2
  • Remark 3.3
  • Corollary 1
  • Corollary 2
  • Lemma 3.3
  • Lemma 3.4
  • Remark 3.4
  • ...and 19 more