Table of Contents
Fetching ...

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

Hongchang Gao

TL;DR

This work tackles decentralized stochastic multi-level compositional optimization, where nested functions across distributed devices define the objective $F(x)=\frac{1}{N}\sum_{n=1}^{N}F_n(x)$. It introduces two algorithms, DSMCGDM and DSMCVRG, that achieve level-independent convergence in nonconvex settings by combining momentum, gradient tracking, and STORM-like variance reduction for inner levels (with a practical alternative for outer gradients in the second method). Theoretical results show rate guarantees: $O((1-\lambda)^{-2}\epsilon^{-4})$ for the momentum-based method and $O((1-\lambda)^{-2}\epsilon^{-3})$ for the variance-reduced variant, with sample and communication costs scaling as $O((1-\lambda)^{-2}\epsilon^{-4})$ under unit mini-batch sizes. Empirical results on multi-step model-agnostic meta-learning tasks corroborate the advantages of the proposed decentralized approaches over standard DSGD, including faster convergence and better scalability across graph topologies and additional levels.

Abstract

Stochastic multi-level compositional optimization problems cover many new machine learning paradigms, e.g., multi-step model-agnostic meta-learning, which require efficient optimization algorithms for large-scale data. This paper studies the decentralized stochastic multi-level optimization algorithm, which is challenging because the multi-level structure and decentralized communication scheme may make the number of levels significantly affect the order of the convergence rate. To this end, we develop two novel decentralized optimization algorithms to optimize the multi-level compositional optimization problem. Our theoretical results show that both algorithms can achieve the level-independent convergence rate for nonconvex problems under much milder conditions compared with existing single-machine algorithms. To the best of our knowledge, this is the first work that achieves the level-independent convergence rate under the decentralized setting. Moreover, extensive experiments confirm the efficacy of our proposed algorithms.

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

TL;DR

This work tackles decentralized stochastic multi-level compositional optimization, where nested functions across distributed devices define the objective . It introduces two algorithms, DSMCGDM and DSMCVRG, that achieve level-independent convergence in nonconvex settings by combining momentum, gradient tracking, and STORM-like variance reduction for inner levels (with a practical alternative for outer gradients in the second method). Theoretical results show rate guarantees: for the momentum-based method and for the variance-reduced variant, with sample and communication costs scaling as under unit mini-batch sizes. Empirical results on multi-step model-agnostic meta-learning tasks corroborate the advantages of the proposed decentralized approaches over standard DSGD, including faster convergence and better scalability across graph topologies and additional levels.

Abstract

Stochastic multi-level compositional optimization problems cover many new machine learning paradigms, e.g., multi-step model-agnostic meta-learning, which require efficient optimization algorithms for large-scale data. This paper studies the decentralized stochastic multi-level optimization algorithm, which is challenging because the multi-level structure and decentralized communication scheme may make the number of levels significantly affect the order of the convergence rate. To this end, we develop two novel decentralized optimization algorithms to optimize the multi-level compositional optimization problem. Our theoretical results show that both algorithms can achieve the level-independent convergence rate for nonconvex problems under much milder conditions compared with existing single-machine algorithms. To the best of our knowledge, this is the first work that achieves the level-independent convergence rate under the decentralized setting. Moreover, extensive experiments confirm the efficacy of our proposed algorithms.
Paper Structure (20 sections, 28 theorems, 117 equations, 4 figures, 2 algorithms)

This paper contains 20 sections, 28 theorems, 117 equations, 4 figures, 2 algorithms.

Key Result

Theorem 1

Given Assumptions assumption_graph-assumption_bound_variance, by setting $\mu>0$, $\beta>0$, $\alpha\leq \min \{{ (1-\lambda)^2}/\sqrt{\tilde{\alpha}_1}, 1/(4\sqrt{\tilde{\alpha}_2})\}$, $\eta \leq \min\{\tilde{\omega}_k/(8\beta\sum_{j=1}^{K-1} \tilde{\omega}_jC_j^2 \prod_{i=k+1}^{j}(2C_{i}^2)) , where $\tilde{\omega}_k= \frac{2}{\beta} ((12A_k+8 D_{k})\mu + 2 A_k + 2 \beta \sum_{j=k+1}^{K

Figures (4)

  • Figure 1: Regression: The loss function value on support and query sets versus the number of iterations for the ring and random graph.
  • Figure 2: Classification: The loss function value on support set and test accuracy on query set versus the number of epochs for the ring and random graph.
  • Figure 3: The loss function value on support test versus the number of iterations for the regression task. Ring graph is used.
  • Figure 4: The loss function value on support set versus the consumed time for the regression task and ring graph.

Theorems & Definitions (49)

  • Theorem 1
  • Corollary 1
  • Remark 1
  • Remark 2
  • Remark 3
  • Theorem 2
  • Corollary 2
  • Remark 4
  • Remark 5
  • Remark 6
  • ...and 39 more