Table of Contents
Fetching ...

Problem-Parameter-Free Decentralized Nonconvex Stochastic Optimization

Jiaxiang Li, Xuxing Chen, Shiqian Ma, Mingyi Hong

TL;DR

This paper addresses decentralized nonconvex stochastic optimization without relying on problem parameters like Lipschitz constants or network spectrum. It introduces D-NASA, a parameter-free algorithm that uses normalized gradient directions and moving-average tracking to control consensus error and enable convergence without prior problem information. Theoretical results show that D-NASA achieves optimal nonconvex stochastic convergence rates and linear speedup in the number of nodes, matching lower bounds under standard assumptions. Empirical evaluations on synthetic and real datasets demonstrate robust performance and superior generalization without hyperparameter tuning across diverse network topologies, underscoring practical impact for large-scale distributed learning. Overall, the work closes a key gap between theory and practice in decentralized optimization by delivering a parameter-free, scalable, and provably efficient algorithm.

Abstract

Existing decentralized algorithms usually require knowledge of problem parameters for updating local iterates. For example, the hyperparameters (such as learning rate) usually require the knowledge of Lipschitz constant of the global gradient or topological information of the communication networks, which are usually not accessible in practice. In this paper, we propose D-NASA, the first algorithm for decentralized nonconvex stochastic optimization that requires no prior knowledge of any problem parameters. We show that D-NASA has the optimal rate of convergence for nonconvex objectives under very mild conditions and enjoys the linear-speedup effect, i.e. the computation becomes faster as the number of nodes in the system increases. Extensive numerical experiments are conducted to support our findings.

Problem-Parameter-Free Decentralized Nonconvex Stochastic Optimization

TL;DR

This paper addresses decentralized nonconvex stochastic optimization without relying on problem parameters like Lipschitz constants or network spectrum. It introduces D-NASA, a parameter-free algorithm that uses normalized gradient directions and moving-average tracking to control consensus error and enable convergence without prior problem information. Theoretical results show that D-NASA achieves optimal nonconvex stochastic convergence rates and linear speedup in the number of nodes, matching lower bounds under standard assumptions. Empirical evaluations on synthetic and real datasets demonstrate robust performance and superior generalization without hyperparameter tuning across diverse network topologies, underscoring practical impact for large-scale distributed learning. Overall, the work closes a key gap between theory and practice in decentralized optimization by delivering a parameter-free, scalable, and provably efficient algorithm.

Abstract

Existing decentralized algorithms usually require knowledge of problem parameters for updating local iterates. For example, the hyperparameters (such as learning rate) usually require the knowledge of Lipschitz constant of the global gradient or topological information of the communication networks, which are usually not accessible in practice. In this paper, we propose D-NASA, the first algorithm for decentralized nonconvex stochastic optimization that requires no prior knowledge of any problem parameters. We show that D-NASA has the optimal rate of convergence for nonconvex objectives under very mild conditions and enjoys the linear-speedup effect, i.e. the computation becomes faster as the number of nodes in the system increases. Extensive numerical experiments are conducted to support our findings.
Paper Structure (14 sections, 16 theorems, 86 equations, 7 figures, 1 table, 2 algorithms)

This paper contains 14 sections, 16 theorems, 86 equations, 7 figures, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Suppose Assumptions assump_l_smooth, assump_bdd_variance and assump_l_continuous hold, also take $\eta_t=\eta T^{-1/2}$ (constant) or $\eta_t=\eta t^{-1/2}$ (diminishing, $\eta_0=0$ for this case) for $\eta>0$, the update of Algorithm algo_decen_grad_tracking satisfies: Here $\Tilde{\rho}>0$ is a parameter dependent on $\rho$ in eq_rho, $\Delta_0=f(\Bar{x}^0)-f^*$ is the initial function value ga

Figures (7)

  • Figure 1: The convergence curve of Algorithm \ref{['algo_decen_normalized_averaged_grad_tracking']} to solve \ref{['eq_synthetic_ls']} with different choice of number of nodes/devices $n\in\{5, 10, 20\}$. The above two figures correspond to fixed stepsizes ($\alpha_t=\sqrt{n/T}$, $\eta_t=n^{1/4}/T^{3/4}$) and below two corresponds to diminishing stepsizes ($\alpha_t=\sqrt{n/t}$, $\eta_t=n^{1/4}/t^{3/4}$), respectively.
  • Figure 2: Convergence curve for D-SGD, D-SGD, D-ASAGT and D-NASA for solving \ref{['eq_synthetic_ls']} under the spike model.
  • Figure 3: The testing accuracy of the outputs from different algorithms with respect to different choices of learning rates.
  • Figure 4: Network topology for $n=8$. The four graphs represent the ring, (an instance of) the random, the ladder and the complete graph.
  • Figure 5: The testing accuracy of the outputs from different algorithms with respect to different choices of learning rates for a9a dataset. The four figures corresponds to four different network graphs as in Figure \ref{['fig:different_graphs']}.
  • ...and 2 more figures

Theorems & Definitions (20)

  • Definition 3.1
  • Theorem 3.1
  • Remark 3.1
  • Theorem 3.2
  • Remark 3.2
  • Theorem 3.3
  • Remark 3.3
  • Lemma B.1
  • Lemma B.2
  • Lemma B.3: Lemma 3.3 in pmlr_v216_xiao23a
  • ...and 10 more