Table of Contents
Fetching ...

Guaranteeing Both Consensus and Optimality in Decentralized Nonconvex Optimization with Multiple Local Updates

Jie Liu, Zuang Wang, Yongqiang Wang

TL;DR

This work tackles decentralized nonconvex optimization with multiple local updates and no central server. It introduces MILE, a novel algorithm that achieves both consensus and optimality by recasting the update dynamics as a periodic system and applying lifting techniques to derive a closed-form state evolution. The authors prove an $O(1/T)$ convergence rate under exact and stochastic gradients and demonstrate communication and memory efficiency by requiring only a single exchanged vector and storing two variables per agent. Empirical results on MNIST and CIFAR-10 corroborate the theoretical guarantees, showing faster convergence than state-of-the-art multi-update decentralized methods and highlighting MILE's practical value for large-scale, heterogeneous networks.

Abstract

Scalable decentralized optimization in large-scale systems hinges on efficient communication. A common way to reduce communication overhead is to perform multiple local updates between two communication rounds, as in federated learning. However, extending this strategy to fully decentralized settings poses fundamental challenges. Existing decentralized algorithms with multiple local updates guarantee accurate convergence only under strong convexity, limiting applicability to the nonconvex problems prevalent in machine learning. Moreover, many methods require exchanging and storing auxiliary variables, such as gradient-tracking vectors or correction terms, to ensure convergence under data heterogeneity, incurring high communication and memory costs. In this paper, we propose MILE, a fully decentralized algorithm that guarantees both consensus and optimality under multiple local updates in general nonconvex settings. This is achieved through a novel periodic-system-based formulation and a lifting-based analysis, which together yield a closed-form expression for the state evolution across local updates, a theoretical advance not achieved previously. This closed-form characterization allows us to establish, for the first time, guaranteed consensus and optimality in decentralized nonconvex optimization under multiple local updates, in contrast to prior results that only ensure optimality of the average state. We prove that MILE achieves an $O(1/T)$ convergence rate under both exact and stochastic gradients, while requiring only a single variable exchange per interacting agent pair, minimizing communication and memory costs. Numerical experiments on benchmark datasets confirm its effectiveness.

Guaranteeing Both Consensus and Optimality in Decentralized Nonconvex Optimization with Multiple Local Updates

TL;DR

This work tackles decentralized nonconvex optimization with multiple local updates and no central server. It introduces MILE, a novel algorithm that achieves both consensus and optimality by recasting the update dynamics as a periodic system and applying lifting techniques to derive a closed-form state evolution. The authors prove an convergence rate under exact and stochastic gradients and demonstrate communication and memory efficiency by requiring only a single exchanged vector and storing two variables per agent. Empirical results on MNIST and CIFAR-10 corroborate the theoretical guarantees, showing faster convergence than state-of-the-art multi-update decentralized methods and highlighting MILE's practical value for large-scale, heterogeneous networks.

Abstract

Scalable decentralized optimization in large-scale systems hinges on efficient communication. A common way to reduce communication overhead is to perform multiple local updates between two communication rounds, as in federated learning. However, extending this strategy to fully decentralized settings poses fundamental challenges. Existing decentralized algorithms with multiple local updates guarantee accurate convergence only under strong convexity, limiting applicability to the nonconvex problems prevalent in machine learning. Moreover, many methods require exchanging and storing auxiliary variables, such as gradient-tracking vectors or correction terms, to ensure convergence under data heterogeneity, incurring high communication and memory costs. In this paper, we propose MILE, a fully decentralized algorithm that guarantees both consensus and optimality under multiple local updates in general nonconvex settings. This is achieved through a novel periodic-system-based formulation and a lifting-based analysis, which together yield a closed-form expression for the state evolution across local updates, a theoretical advance not achieved previously. This closed-form characterization allows us to establish, for the first time, guaranteed consensus and optimality in decentralized nonconvex optimization under multiple local updates, in contrast to prior results that only ensure optimality of the average state. We prove that MILE achieves an convergence rate under both exact and stochastic gradients, while requiring only a single variable exchange per interacting agent pair, minimizing communication and memory costs. Numerical experiments on benchmark datasets confirm its effectiveness.

Paper Structure

This paper contains 24 sections, 8 theorems, 136 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

For a scalar sequence $\{a(t)\}^{\infty}_{t=0}$ satisfying the recursive relation where $\frac{\tau-1}{\tau+3}<\rho<1$, and $k\in\mathbb{Z}$, we have the following results for $1\leq p\leq \tau$ (with $a(0)\in\mathbb{R}$ and $a(1)\in\mathbb{R}$ being initial values of the sequence) where $\cos(\theta)=\frac{\rho(\tau+1)+1-\tau}{2\sqrt{\rho}}$, $\sin(\theta)=\frac{\sqrt{4\rho-[\rho(\tau+1)+(1-\t

Figures (2)

  • Figure 1: Comparison of training loss under a common stepsize ($\alpha=0.12$ in subplot (a) and $\alpha=0.04$ in subplot (b)) between MILE and DIGing WeiShi3, K-GT taolin1, LED LED. Subplot (a) shows the results on the MNIST dataset, while subplot (b) presents the results on the CIFAR-10 dataset. The number of local updates was set to $\tau=10$. Each curve represents the average of three independent runs.
  • Figure 2: Comparison of training loss under the best-found stepsize for each algorithm. Subplot (a) shows the results on the MNIST dataset, while subplot (b) presents the results on the CIFAR-10 dataset. The number of local updates was set to $\tau=10$. Each curve represents the average of three independent runs.

Theorems & Definitions (18)

  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Remark 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • ...and 8 more