Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

Hongchang Gao

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

Hongchang Gao

TL;DR

This work tackles decentralized stochastic multi-level compositional optimization, where nested functions across distributed devices define the objective $F(x)=\frac{1}{N}\sum_{n=1}^{N}F_n(x)$. It introduces two algorithms, DSMCGDM and DSMCVRG, that achieve level-independent convergence in nonconvex settings by combining momentum, gradient tracking, and STORM-like variance reduction for inner levels (with a practical alternative for outer gradients in the second method). Theoretical results show rate guarantees: $O((1-\lambda)^{-2}\epsilon^{-4})$ for the momentum-based method and $O((1-\lambda)^{-2}\epsilon^{-3})$ for the variance-reduced variant, with sample and communication costs scaling as $O((1-\lambda)^{-2}\epsilon^{-4})$ under unit mini-batch sizes. Empirical results on multi-step model-agnostic meta-learning tasks corroborate the advantages of the proposed decentralized approaches over standard DSGD, including faster convergence and better scalability across graph topologies and additional levels.

Abstract

Stochastic multi-level compositional optimization problems cover many new machine learning paradigms, e.g., multi-step model-agnostic meta-learning, which require efficient optimization algorithms for large-scale data. This paper studies the decentralized stochastic multi-level optimization algorithm, which is challenging because the multi-level structure and decentralized communication scheme may make the number of levels significantly affect the order of the convergence rate. To this end, we develop two novel decentralized optimization algorithms to optimize the multi-level compositional optimization problem. Our theoretical results show that both algorithms can achieve the level-independent convergence rate for nonconvex problems under much milder conditions compared with existing single-machine algorithms. To the best of our knowledge, this is the first work that achieves the level-independent convergence rate under the decentralized setting. Moreover, extensive experiments confirm the efficacy of our proposed algorithms.

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

TL;DR

This work tackles decentralized stochastic multi-level compositional optimization, where nested functions across distributed devices define the objective

. It introduces two algorithms, DSMCGDM and DSMCVRG, that achieve level-independent convergence in nonconvex settings by combining momentum, gradient tracking, and STORM-like variance reduction for inner levels (with a practical alternative for outer gradients in the second method). Theoretical results show rate guarantees:

for the momentum-based method and

for the variance-reduced variant, with sample and communication costs scaling as

under unit mini-batch sizes. Empirical results on multi-step model-agnostic meta-learning tasks corroborate the advantages of the proposed decentralized approaches over standard DSGD, including faster convergence and better scalability across graph topologies and additional levels.

Abstract

Paper Structure (20 sections, 28 theorems, 117 equations, 4 figures, 2 algorithms)

This paper contains 20 sections, 28 theorems, 117 equations, 4 figures, 2 algorithms.

Introduction
Related Work
Stochastic Two-Level Compositional Optimization
Stochastic Multi-Level Compositional Optimization
Decentralized Compositional Optimization
Decentralized Stochastic Multi-Level Compositional Optimization
Decentralized Stochastic Multi-level Compositional Gradient Descent with Momentum
Challenges.
Decentralized Stochastic Multi-Level Compositional Variance-Reduced Gradient Descent
Novelty.
Convergence Analysis
Experiment
Multi-Step Model-Agnostic Meta-Learning
Experimental Settings and Results
More Experiments
...and 5 more sections

Key Result

Theorem 1

Given Assumptions assumption_graph-assumption_bound_variance, by setting $\mu>0$, $\beta>0$, $\alpha\leq \min \{{ (1-\lambda)^2}/\sqrt{\tilde{\alpha}_1}, 1/(4\sqrt{\tilde{\alpha}_2})\}$, $\eta \leq \min\{\tilde{\omega}_k/(8\beta\sum_{j=1}^{K-1} \tilde{\omega}_jC_j^2 \prod_{i=k+1}^{j}(2C_{i}^2)) , where $\tilde{\omega}_k= \frac{2}{\beta} ((12A_k+8 D_{k})\mu + 2 A_k + 2 \beta \sum_{j=k+1}^{K

Figures (4)

Figure 1: Regression: The loss function value on support and query sets versus the number of iterations for the ring and random graph.
Figure 2: Classification: The loss function value on support set and test accuracy on query set versus the number of epochs for the ring and random graph.
Figure 3: The loss function value on support test versus the number of iterations for the regression task. Ring graph is used.
Figure 4: The loss function value on support set versus the consumed time for the regression task and ring graph.

Theorems & Definitions (49)

Theorem 1
Corollary 1
Remark 1
Remark 2
Remark 3
Theorem 2
Corollary 2
Remark 4
Remark 5
Remark 6
...and 39 more

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

TL;DR

Abstract

Decentralized Multi-Level Compositional Optimization Algorithms with Level-Independent Convergence Rate

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (49)