Table of Contents
Fetching ...

Faster Adaptive Decentralized Learning Algorithms

Feihu Huang, Jianyu Zhao

TL;DR

This work tackles decentralized nonconvex optimization on networks by introducing AdaMDOS and AdaMDOF, two adaptive momentum-based algorithms that integrate gradient tracking with a unified adaptive learning-rate matrix. The authors establish a solid convergence framework and prove near-optimal sample complexities: $\tilde{O}(\epsilon^{-3})$ for stochastic problems and $O(\sqrt{n}\epsilon^{-2})$ for finite-sum problems. Empirical results on logistic regression, CNNs, and ResNets across ring and expander networks corroborate the theoretical gains, showing faster convergence and better efficiency than existing adaptive decentralized methods. The proposed approach offers a scalable, privacy-preserving alternative for distributed nonconvex learning with strong theoretical guarantees and practical performance.

Abstract

Decentralized learning recently has received increasing attention in machine learning due to its advantages in implementation simplicity and system robustness, data privacy. Meanwhile, the adaptive gradient methods show superior performances in many machine learning tasks such as training neural networks. Although some works focus on studying decentralized optimization algorithms with adaptive learning rates, these adaptive decentralized algorithms still suffer from high sample complexity. To fill these gaps, we propose a class of faster adaptive decentralized algorithms (i.e., AdaMDOS and AdaMDOF) for distributed nonconvex stochastic and finite-sum optimization, respectively. Moreover, we provide a solid convergence analysis framework for our methods. In particular, we prove that our AdaMDOS obtains a near-optimal sample complexity of $\tilde{O}(ε^{-3})$ for finding an $ε$-stationary solution of nonconvex stochastic optimization. Meanwhile, our AdaMDOF obtains a near-optimal sample complexity of $O(\sqrt{n}ε^{-2})$ for finding an $ε$-stationary solution of nonconvex finite-sum optimization, where $n$ denotes the sample size. To the best of our knowledge, our AdaMDOF algorithm is the first adaptive decentralized algorithm for nonconvex finite-sum optimization. Some experimental results demonstrate efficiency of our algorithms.

Faster Adaptive Decentralized Learning Algorithms

TL;DR

This work tackles decentralized nonconvex optimization on networks by introducing AdaMDOS and AdaMDOF, two adaptive momentum-based algorithms that integrate gradient tracking with a unified adaptive learning-rate matrix. The authors establish a solid convergence framework and prove near-optimal sample complexities: for stochastic problems and for finite-sum problems. Empirical results on logistic regression, CNNs, and ResNets across ring and expander networks corroborate the theoretical gains, showing faster convergence and better efficiency than existing adaptive decentralized methods. The proposed approach offers a scalable, privacy-preserving alternative for distributed nonconvex learning with strong theoretical guarantees and practical performance.

Abstract

Decentralized learning recently has received increasing attention in machine learning due to its advantages in implementation simplicity and system robustness, data privacy. Meanwhile, the adaptive gradient methods show superior performances in many machine learning tasks such as training neural networks. Although some works focus on studying decentralized optimization algorithms with adaptive learning rates, these adaptive decentralized algorithms still suffer from high sample complexity. To fill these gaps, we propose a class of faster adaptive decentralized algorithms (i.e., AdaMDOS and AdaMDOF) for distributed nonconvex stochastic and finite-sum optimization, respectively. Moreover, we provide a solid convergence analysis framework for our methods. In particular, we prove that our AdaMDOS obtains a near-optimal sample complexity of for finding an -stationary solution of nonconvex stochastic optimization. Meanwhile, our AdaMDOF obtains a near-optimal sample complexity of for finding an -stationary solution of nonconvex finite-sum optimization, where denotes the sample size. To the best of our knowledge, our AdaMDOF algorithm is the first adaptive decentralized algorithm for nonconvex finite-sum optimization. Some experimental results demonstrate efficiency of our algorithms.
Paper Structure (22 sections, 13 theorems, 127 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 13 theorems, 127 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 5.1

Suppose the sequences $\{\{x^i_t\}_{i=1}^m\}_{t=1}^T$ be generated from Algorithm alg:1. Under the above Assumptions ass:1-ass:5, and let $\eta_t=\eta$, $0<\beta_t\leq1$ for all $t\geq 0$, $\gamma\leq \min(\frac{\rho(1-\nu^2)}{48\theta_t},\frac{3\rho(1-\nu^2)\theta_t}{58L^2})$, $\eta\leq \min(\frac{ where $G= \frac{F(\bar{x}_1)-F^*}{\rho\gamma\eta}+( \frac{4\nu^2}{\rho^2(1-\nu)} +\frac{9}{2\rho^2\

Figures (7)

  • Figure 1: Stationary gap vs epoch at w8a dataset under the ring network (Left) and the 3-regular network (Right).
  • Figure 2: Stationary gap vs epoch at covertype dataset under the ring network (Left) and the 3-regular network (Right).
  • Figure 3: Training CNN on MNIST dataset: training loss vs epoch (Left), training accuracy (%) vs epoch (Middle), and test accuracy (%) vs epoch (Right) under the ring network.
  • Figure 4: Training CNN on MNIST dataset: training loss vs epoch (Left), training accuracy (%) vs epoch (Middle), and test accuracy (%) vs epoch (Right) under the 3-regular network.
  • Figure 5: Training ResNet-18 on Tiny-ImageNet dataset: test accuracy (%) vs epoch under the ring network (Left) and the 3-regular network (Right).
  • ...and 2 more figures

Theorems & Definitions (25)

  • Theorem 5.1
  • Remark 5.2
  • Remark 5.4
  • Theorem 5.5
  • Remark 5.6
  • Lemma 1.1
  • Lemma 1.2
  • Lemma 1.3
  • proof
  • Lemma 1.4
  • ...and 15 more