Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

Youcheng Niu; Jinming Xu; Ying Sun; Yan Huang; Li Chai

Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

Youcheng Niu, Jinming Xu, Ying Sun, Yan Huang, Li Chai

TL;DR

This work tackles distributed stochastic bilevel optimization with personalized inner problems, formulating $\Phi(x)=\frac{1}{m}\sum_i f_i(x,\theta_i^*(x))$ where $\theta_i^*(x)=\arg\min_\theta g_i(x,\theta)$. It introduces LoPA, a loopless, communication-efficient algorithm with two variants: LoPA-LG (local gradient) and LoPA-GT (gradient tracking), including a gradient momentum mechanism to mitigate hypergradient bias. The authors provide a comprehensive heterogeneity-aware convergence analysis, deriving rate bounds that explicitly depend on condition number $\kappa$, network gap $\rho$, heterogeneity $b$, and gradient variances $\sigma_p, \sigma_c$; gradient tracking further reduces heterogeneity effects and improves rates. They prove that LoPA attains $\mathcal{O}(\epsilon^{-2})$ complexity in terms of Hessian evaluations, surpassing prior DSBO methods, and validate the theory with numerical experiments on distributed classification and hyperparameter optimization.

Abstract

This paper consider solving a class of nonconvex-strongly-convex distributed stochastic bilevel optimization (DSBO) problems with personalized inner-level objectives. Most existing algorithms require computational loops for hypergradient estimation, leading to computational inefficiency. Moreover, the impact of data heterogeneity on convergence in bilevel problems is not explicitly characterized yet. To address these issues, we propose LoPA, a loopless personalized distributed algorithm that leverages a tracking mechanism for iterative approximation of inner-level solutions and Hessian-inverse matrices without relying on extra computation loops. Our theoretical analysis explicitly characterizes the heterogeneity across nodes (denoted by $b$), and establishes a sublinear rate of $\mathcal{O}( {\frac{1}{{{{\left( {1 - ρ} \right)}}K}} \!+ \!\frac{{(\frac{b}{\sqrt{m}})^{\frac{2}{3}} }}{{\left( {1 - ρ} \right)^{\frac{2}{3}} K^{\frac{2}{3}} }} \!+ \!\frac{1}{\sqrt{ K }}( {σ_{\operatorname{p} }} + \frac{1}{\sqrt{m}}{σ_{\operatorname{c} }} ) } )$ without the boundedness of local hypergradients, where ${σ_{\operatorname{p} }}$ and ${σ_{\operatorname{c} }}$ represent the gradient sampling variances associated with the inner- and outer-level variables, respectively. We also integrate LoPA with a gradient tracking scheme to eliminate the impact of data heterogeneity, yielding an improved rate of ${\mathcal{O}}(\frac{1}{ (1-ρ)^2K } \!+\! \frac{1}{\sqrt{K}}( σ_{\rm{p}} \!+\! \frac{1}{\sqrt{m}}σ_{\rm{c}} ) )$. The computational complexity of LoPA is of ${\mathcal{O}}({ε^{-2}})$ to an $ε$-stationary point, matching the communication complexity due to the loopless structure, which outperforms existing counterparts for DSBO. Numerical experiments validate the effectiveness of the proposed algorithm.

Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

TL;DR

This work tackles distributed stochastic bilevel optimization with personalized inner problems, formulating

where

. It introduces LoPA, a loopless, communication-efficient algorithm with two variants: LoPA-LG (local gradient) and LoPA-GT (gradient tracking), including a gradient momentum mechanism to mitigate hypergradient bias. The authors provide a comprehensive heterogeneity-aware convergence analysis, deriving rate bounds that explicitly depend on condition number

, network gap

, heterogeneity

, and gradient variances

; gradient tracking further reduces heterogeneity effects and improves rates. They prove that LoPA attains

complexity in terms of Hessian evaluations, surpassing prior DSBO methods, and validate the theory with numerical experiments on distributed classification and hyperparameter optimization.

Abstract

), and establishes a sublinear rate of

without the boundedness of local hypergradients, where

and

represent the gradient sampling variances associated with the inner- and outer-level variables, respectively. We also integrate LoPA with a gradient tracking scheme to eliminate the impact of data heterogeneity, yielding an improved rate of

. The computational complexity of LoPA is of

to an

-stationary point, matching the communication complexity due to the loopless structure, which outperforms existing counterparts for DSBO. Numerical experiments validate the effectiveness of the proposed algorithm.

Paper Structure (32 sections, 17 theorems, 167 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 17 theorems, 167 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Related Works
Algorithm Design
Preliminaries
The Proposed LoPA Algorithm
Convergence Results
Preliminaries
Convergence of LoPA-LG and LoPA-GT
Proof Sketch and Supporting Lemmas for Theorems \ref{['TH-1']} and \ref{['TH-2']}
Further Discussions on Convergence Analysis
Numerical Experiments
Distributed Classification
Hyperparameter Optimization
Conclusion
Technical Preliminaries
...and 17 more sections

Key Result

Proposition 3

Suppose Assumptions ASS-OUTLEVEL and ASS-INNERLEVEL hold. Let $\bar{\nabla} {f_i}( {x,\theta})$$\triangleq\nabla_x f(x,\theta)-\nabla_{x\theta}^2g_i(x,\theta)v_i(x,\theta)$ be a surrogate of the local hypergradient $\nabla {f_i}(x,\theta_i^*(x))$ and denote ${v_i}\left( {x,\theta } \right)\triangleq

Figures (8)

Figure 1: Performance comparison of SPDB, MA-DSBO and our LoPA-LG and LoPA-GT algorithms over $4$ nodes for a 10-class classification task using MNIST dataset.
Figure 2: Performance comparison of SPDB, MA-DSBO and our LoPA-LG and LoPA-GT algorithms over $8$ nodes for a 10-class classification task using MNIST dataset.
Figure 3: Testing accuracy of LoPA-LG and LoPA-GT under different data heterogeneity for a 10-class classification task using MNIST dataset.
Figure 4: Synthetic label distributions with different levels of data heterogeneity across nodes. The label classes are represented with different colors.
Figure 5: Performance comparison of SPDB, MA-DSBO and our LoPA-LG and LoPA-GT algorithms w.r.t. the computational time for hyperparameter optimization on binary logistic regression problems under different datasets: i) MNIST (first column); ii) covtype (second column); iii) cifar10 (third column).
...and 3 more figures

Theorems & Definitions (22)

Remark 1: Weaker assumptions on data heterogeneity
Remark 2: Iterative approximation approach for Hv
Proposition 3: Smoothness property
Proposition 4: Boundness property
Lemma 5: Bounded heterogeneity on overall hypergradients
Theorem 6
Corollary 7
Remark 8: Heterogeneity analysis
Theorem 9
Corollary 10
...and 12 more

Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

TL;DR

Abstract

Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (22)