Table of Contents
Fetching ...

Adaptive Federated Learning Over the Air

Chenhao Wang, Zihan Chen, Nikolaos Pappas, Howard H. Yang, Tony Q. S. Quek, H. Vincent Poor

TL;DR

This work derives the convergence rate of the training algorithms for a broad spectrum of nonconvex loss functions, encompassing the effects of channel fading, and interference that follows a heavy-tailed distribution, from a federated version of adaptive gradient methods.

Abstract

We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. This approach capitalizes on the inherent superposition property of wireless channels, facilitating fast and scalable parameter aggregation. Meanwhile, it enhances the robustness of the model training process by dynamically adjusting the stepsize in accordance with the global gradient update. We derive the convergence rate of the training algorithms, encompassing the effects of channel fading and interference, for a broad spectrum of nonconvex loss functions. Our analysis shows that the AdaGrad-based algorithm converges to a stationary point at the rate of $\mathcal{O}( \ln{(T)} /{ T^{ 1 - \frac{1}α } } )$, where $α$ represents the tail index of the electromagnetic interference. This result indicates that the level of heavy-tailedness in interference distribution plays a crucial role in the training efficiency: the heavier the tail, the slower the algorithm converges. In contrast, an Adam-like algorithm converges at the $\mathcal{O}( 1/T )$ rate, demonstrating its advantage in expediting the model training process. We conduct extensive experiments that corroborate our theoretical findings and affirm the practical efficacy of our proposed federated adaptive gradient methods.

Adaptive Federated Learning Over the Air

TL;DR

This work derives the convergence rate of the training algorithms for a broad spectrum of nonconvex loss functions, encompassing the effects of channel fading, and interference that follows a heavy-tailed distribution, from a federated version of adaptive gradient methods.

Abstract

We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training. This approach capitalizes on the inherent superposition property of wireless channels, facilitating fast and scalable parameter aggregation. Meanwhile, it enhances the robustness of the model training process by dynamically adjusting the stepsize in accordance with the global gradient update. We derive the convergence rate of the training algorithms, encompassing the effects of channel fading and interference, for a broad spectrum of nonconvex loss functions. Our analysis shows that the AdaGrad-based algorithm converges to a stationary point at the rate of , where represents the tail index of the electromagnetic interference. This result indicates that the level of heavy-tailedness in interference distribution plays a crucial role in the training efficiency: the heavier the tail, the slower the algorithm converges. In contrast, an Adam-like algorithm converges at the rate, demonstrating its advantage in expediting the model training process. We conduct extensive experiments that corroborate our theoretical findings and affirm the practical efficacy of our proposed federated adaptive gradient methods.
Paper Structure (24 sections, 57 equations, 7 figures, 1 algorithm)

This paper contains 24 sections, 57 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: An overview of the over-the-air edge learning system. The local gradients of each client are uploaded via analog transmissions, which automatically aggregate at the RF front end of the access point. The server filters out this radio signal to obtain a noisy global gradient, which is further processed and used to improve the global model. Steps of the model training in a typical communication round are numbered accordingly.
  • Figure 2: Performance comparison of the test accuracy and training loss of different tasks with non-i.i.d. data partition Dir=0.1 under heavy tail index $\alpha = 1.5$. Here (a) and (d) are for ResNet-18 on the CIFAR-10 dataset, (b) and (e) are for ResNet-34 on the CIFAR-100 dataset, and (c) and (f) are for logistic regression on the EMNIST dataset.
  • Figure 3: Performance comparison for test accuracy and training loss under tail index $\alpha = 1.8$ and scale = $0.01$, of training a ResNet-18 on the CIFAR-10 dataset.
  • Figure 4: Performance comparison for training loss with $\beta_1 = 0$ and non-i.i.d. data partition Dir = 0.1 under different $\beta_2$. We use the Adam-OTA method to train ResNet-18 on CIFAR-10.
  • Figure 5: Performance comparison for training loss with non-i.i.d. data partition Dir = $0.1$ under different $\alpha$. We use the AdaGrad-OTA method to train ResNet-18 on CIFAR-10.
  • ...and 2 more figures