Table of Contents
Fetching ...

Decentralized Non-convex Stochastic Optimization with Heterogeneous Variance

Hongxu Chen, Ke Wei, Luo Luo

TL;DR

The paper tackles decentralized non-convex stochastic optimization with heterogeneous gradient noise across nodes. It introduces D-NSS, which allocates node-specific sampling to achieve a sample complexity that scales with the arithmetic mean of local variances, and proves a matching lower bound to establish optimality. It further extends to D-NSS-VR under mean-squared smoothness, achieving improved rates while preserving the arithmetic-mean variance dependence. The theory is corroborated by numerical experiments on real-world datasets, showing practical improvements over state-of-the-art methods. Overall, the work clarifies how variance heterogeneity shapes decentralized learning and provides near-optimal algorithms with solid theoretical guarantees.

Abstract

Decentralized optimization is critical for solving large-scale machine learning problems over distributed networks, where multiple nodes collaborate through local communication. In practice, the variances of stochastic gradient estimators often differ across nodes, yet their impact on algorithm design and complexity remains unclear. To address this issue, we propose D-NSS, a decentralized algorithm with node-specific sampling, and establish its sample complexity depending on the arithmetic mean of local standard deviations, achieving tighter bounds than existing methods that rely on the worst-case or quadratic mean. We further derive a matching sample complexity lower bound under heterogeneous variance, thereby proving the optimality of this dependence. Moreover, we extend the framework with a variance reduction technique and develop D-NSS-VR, which under the mean-squared smoothness assumption attains an improved sample complexity bound while preserving the arithmetic-mean dependence. Finally, numerical experiments validate the theoretical results and demonstrate the effectiveness of the proposed algorithms.

Decentralized Non-convex Stochastic Optimization with Heterogeneous Variance

TL;DR

The paper tackles decentralized non-convex stochastic optimization with heterogeneous gradient noise across nodes. It introduces D-NSS, which allocates node-specific sampling to achieve a sample complexity that scales with the arithmetic mean of local variances, and proves a matching lower bound to establish optimality. It further extends to D-NSS-VR under mean-squared smoothness, achieving improved rates while preserving the arithmetic-mean variance dependence. The theory is corroborated by numerical experiments on real-world datasets, showing practical improvements over state-of-the-art methods. Overall, the work clarifies how variance heterogeneity shapes decentralized learning and provides near-optimal algorithms with solid theoretical guarantees.

Abstract

Decentralized optimization is critical for solving large-scale machine learning problems over distributed networks, where multiple nodes collaborate through local communication. In practice, the variances of stochastic gradient estimators often differ across nodes, yet their impact on algorithm design and complexity remains unclear. To address this issue, we propose D-NSS, a decentralized algorithm with node-specific sampling, and establish its sample complexity depending on the arithmetic mean of local standard deviations, achieving tighter bounds than existing methods that rely on the worst-case or quadratic mean. We further derive a matching sample complexity lower bound under heterogeneous variance, thereby proving the optimality of this dependence. Moreover, we extend the framework with a variance reduction technique and develop D-NSS-VR, which under the mean-squared smoothness assumption attains an improved sample complexity bound while preserving the arithmetic-mean dependence. Finally, numerical experiments validate the theoretical results and demonstrate the effectiveness of the proposed algorithms.
Paper Structure (18 sections, 15 theorems, 124 equations, 2 figures, 2 tables, 3 algorithms)

This paper contains 18 sections, 15 theorems, 124 equations, 2 figures, 2 tables, 3 algorithms.

Key Result

Theorem 2.6

Under Assumptions ass_2.1--ass_2.4, consider Algorithm algo:d-nss with the following parameter choices: then the output of the algorithm is an $\epsilon$-stationary point satisfying $\mathbb{E} \left[ \|\nabla f(x_{i,\mathrm{out}})\|^2 \right] \leq \epsilon^2$. The sample complexity is upper bounded by and the communication complexity is bounded by

Figures (2)

  • Figure 1: Performance comparison of decentralized algorithms in terms of the number of samples on datasets a9a, w8a, and mnist. The lines represent averages over 5 runs, and the shaded regions denote the standard deviations.
  • Figure 2: Performance comparison of decentralized variance reduction algorithms in terms of the number of samples on datasets a9a, w8a, and mnist. The lines represent averages over 5 runs, and the shaded regions denote the standard deviations.

Theorems & Definitions (27)

  • Remark 2.4
  • Theorem 2.6
  • Definition 3.1: Decentralized first-order algorithm class
  • Theorem 3.2
  • Theorem 4.2
  • Theorem 4.3
  • Remark 4.4
  • Lemma B.1
  • proof
  • Lemma B.2: ye2023multi
  • ...and 17 more