On the Communication Complexity of Decentralized Bilevel Optimization

Yihan Zhang; My T. Thai; Jie Wu; Hongchang Gao

On the Communication Complexity of Decentralized Bilevel Optimization

Yihan Zhang, My T. Thai, Jie Wu, Hongchang Gao

TL;DR

The paper tackles decentralized stochastic bilevel optimization with heterogeneous data across workers by proposing two variance-reduced algorithms, DSVRBGD-S and DSVRBGD-A, that achieve low per-round communication cost and few communication rounds. It develops biased hypergradient estimators and a gradient-tracking framework to enable decentralized updates, and provides convergence guarantees under mild, non-strong heterogeneity assumptions with explicit dependence on the network spectral gap. The key theoretical contributions include a convergence rate of $T=O\left(\frac{1}{K(1-\lambda)^4\epsilon^{3/2}}\right)$ rounds to reach $\epsilon$-accuracy, and the first convergence-rate results for alternating update with a variance-reduced gradient in stochastic bilevel optimization. Empirically, the proposed methods show superior communication efficiency and accuracy on heterogeneous distributed hyperparameter optimization tasks across multiple graph topologies, underscoring practical impact for distributed bilevel learning.

Abstract

Stochastic bilevel optimization finds widespread applications in machine learning, including meta-learning, hyperparameter optimization, and neural architecture search. To extend stochastic bilevel optimization to distributed data, several decentralized stochastic bilevel optimization algorithms have been developed. However, existing methods often suffer from slow convergence rates and high communication costs in heterogeneous settings, limiting their applicability to real-world tasks. To address these issues, we propose two novel decentralized stochastic bilevel gradient descent algorithms based on simultaneous and alternating update strategies. Our algorithms can achieve faster convergence rates and lower communication costs than existing methods. Importantly, our convergence analyses do not rely on strong assumptions regarding heterogeneity. More importantly, our theoretical analysis clearly discloses how the additional communication required for estimating hypergradient under the heterogeneous setting affects the convergence rate. To the best of our knowledge, this is the first time such favorable theoretical results have been achieved with mild assumptions in the heterogeneous setting. Furthermore, we demonstrate how to establish the convergence rate for the alternating update strategy when combined with the variance-reduced gradient. Finally, experimental results confirm the efficacy of our algorithms.

On the Communication Complexity of Decentralized Bilevel Optimization

TL;DR

rounds to reach

-accuracy, and the first convergence-rate results for alternating update with a variance-reduced gradient in stochastic bilevel optimization. Empirically, the proposed methods show superior communication efficiency and accuracy on heterogeneous distributed hyperparameter optimization tasks across multiple graph topologies, underscoring practical impact for distributed bilevel learning.

Abstract

Paper Structure (11 sections, 4 theorems, 12 equations, 6 figures, 1 table)

This paper contains 11 sections, 4 theorems, 12 equations, 6 figures, 1 table.

Introduction
Related Work
Stochastic Bilevel Optimization
Decentralized Stochastic Bilevel Optimization
Preliminaries
Decentralized Stochastic Bilevel Gradient Descent
Estimation of Stochastic Hypergradient
Decentralized Stochastic Variance-Reduced Bilevel Gradient Descent Algorithm
Convergence Analysis
Experiment
Conclusion

Key Result

Theorem 1

Under Assumptions assumption_bi_strong-assumption_graph, by letting $\eta$, $\beta_1$, $\beta_2$, and $\beta_3$ satisfy Eq. (110), and setting $\alpha_1=O(\frac{1}{K})$, $\alpha_2=O(\frac{1}{K})$, and $\alpha_3=O(\frac{1}{K})$, DSVRBGD-S has the following convergence rate: where $B_0$ is the batch size in the first iteration.

Figures (6)

Figure 1: The training loss function value versus the communication cost (MB) with a ring graph.
Figure 2: The training loss function value versus the communication cost (MB) with a random graph.
Figure 3: The training loss function value versus the communication cost (MB) with a torus graph.
Figure 4: The test accuracy versus the communication cost (MB) with a ring graph.
Figure 5: The test accuracy versus the communication cost (MB) with a random graph.
...and 1 more figures

Theorems & Definitions (8)

Remark 1
Remark 2
Theorem 1
Corollary 1
Remark 3
Theorem 2
Corollary 2
Remark 4

On the Communication Complexity of Decentralized Bilevel Optimization

TL;DR

Abstract

On the Communication Complexity of Decentralized Bilevel Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)