Table of Contents
Fetching ...

SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

Shuchen Zhu, Boao Kong, Songtao Lu, Xinmeng Huang, Kun Yuan

TL;DR

SPARKLE introduces a unified, single-loop primal-dual framework for decentralized bilevel optimization, addressing data heterogeneity by integrating diverse correction strategies (ED, EXTRA, GT) and allowing different update schemes across upper-, lower-, and auxiliary-level problems. The authors provide a unified convergence analysis with state-of-the-art rates, demonstrate linear speedup, and show that mixing heterogeneity-correction schemes at different levels yields tangible improvements over GT alone. Through extensive experiments on hyper-cleaning, distributed reinforcement learning, and decentralized meta-learning, SPARKLE achieves robust performance and often outperforms existing decentralized SBO methods. The framework’s flexibility in topology and level-specific updates offers practical benefits for large-scale, distributed learning systems, albeit with current limitations to strongly convex lower-level problems and sensitivity to problem conditioning.

Abstract

This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.

SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

TL;DR

SPARKLE introduces a unified, single-loop primal-dual framework for decentralized bilevel optimization, addressing data heterogeneity by integrating diverse correction strategies (ED, EXTRA, GT) and allowing different update schemes across upper-, lower-, and auxiliary-level problems. The authors provide a unified convergence analysis with state-of-the-art rates, demonstrate linear speedup, and show that mixing heterogeneity-correction schemes at different levels yields tangible improvements over GT alone. Through extensive experiments on hyper-cleaning, distributed reinforcement learning, and decentralized meta-learning, SPARKLE achieves robust performance and often outperforms existing decentralized SBO methods. The framework’s flexibility in topology and level-specific updates offers practical benefits for large-scale, distributed learning systems, albeit with current limitations to strongly convex lower-level problems and sensitivity to problem conditioning.

Abstract

This paper studies decentralized bilevel optimization, in which multiple agents collaborate to solve problems involving nested optimization structures with neighborhood communications. Most existing literature primarily utilizes gradient tracking to mitigate the influence of data heterogeneity, without exploring other well-known heterogeneity-correction techniques such as EXTRA or Exact Diffusion. Additionally, these studies often employ identical decentralized strategies for both upper- and lower-level problems, neglecting to leverage distinct mechanisms across different levels. To address these limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the flexibility to incorporate various heterogeneitycorrection strategies into the algorithm. Moreover, SPARKLE allows for different strategies to solve upper- and lower-level problems. We present a unified convergence analysis for SPARKLE, applicable to all its variants, with state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Our results further reveal that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization, and using mixed strategies in bilevel algorithms brings more benefits than relying solely on gradient tracking.

Paper Structure

This paper contains 57 sections, 25 theorems, 238 equations, 9 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Under Assumptions smooth -- var, there exist proper constant step-sizes $\alpha,\,\beta,\,\gamma$ and momentum coefficient $\theta$, such that the SPARKLE framework listed in Algorithm D-SOBA-SUDA will converge as follow: where $\sigma\triangleq\max\{\sigma_{f,1},\sigma_{g,1},\sigma_{g,2}\}$, $\{\delta_{s,i}\}_{i=1}^3$ are constants depending only on $\mathbf{W}_s,\mathbf{A}_s,\mathbf{B}_s,\math

Figures (9)

  • Figure 1: SPARKLE algorithms, represented by ribbons, employ mixed decentralized mechanisms at the upper-level, lower-level, and auxiliary-level. Distinct colors denote the various decentralized mechanisms.
  • Figure 2: The test accuracy on hyper-cleaning with various SPARKLE-based algorithms using different corruption rates $p$. (Left: $p=0.1$, Middle: $p=0.2$, Right: $p=0.3$.)
  • Figure 3: Test accuracy of SPARKLE-EXTRA on hyper-cleaning. (Left: fixed graph for $x$ and varying graph for $y,z$; Right: fixed for $y,z$ and varying for $x$)
  • Figure 4: The upper-level loss against samples generated by one agent of different algorithms in the policy evaluation. (Left: $n=10$, Right: $n=20$.)
  • Figure 5: The estimation error of D-SOBA, SPARKLE-GT, SPARKLE-ED, and SPARKLE-EXTRA under different networks and data heterogeneity.
  • ...and 4 more figures

Theorems & Definitions (52)

  • Theorem 1
  • Remark 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Remark 2: SOTA transient iterations
  • Remark 3: GT is not the best technique for decentralized SBO
  • Corollary 4
  • Remark 4: Mixed strategies outperform employing GT only
  • Lemma 1
  • ...and 42 more