Table of Contents
Fetching ...

A Communication and Computation Efficient Fully First-order Method for Decentralized Bilevel Optimization

Min Wen, Chengchang Liu, Ahmed Abdelmoniem, Yipeng Zhou, Yuedong Xu

TL;DR

Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of $\text{C}^2$DFB across various typologies and heterogeneous data distributions.

Abstract

Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, $\text{C}^2$DFB which is both compute- and communicate-efficient. In $\text{C}^2$DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, $\text{C}^2$DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of $\tilde{\mathcal{O}}(ε^{-4})$. Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of $\text{C}^2$DFB across various typologies and heterogeneous data distributions.

A Communication and Computation Efficient Fully First-order Method for Decentralized Bilevel Optimization

TL;DR

Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of DFB across various typologies and heterogeneous data distributions.

Abstract

Bilevel optimization, crucial for hyperparameter tuning, meta-learning and reinforcement learning, remains less explored in the decentralized learning paradigm, such as decentralized federated learning (DFL). Typically, decentralized bilevel methods rely on both gradients and Hessian matrices to approximate hypergradients of upper-level models. However, acquiring and sharing the second-order oracle is compute and communication intensive. % and sharing this information incurs heavy communication overhead. To overcome these challenges, this paper introduces a fully first-order decentralized method for decentralized Bilevel optimization, DFB which is both compute- and communicate-efficient. In DFB, each learning node optimizes a min-min-max problem to approximate hypergradient by exclusively using gradients information. To reduce the traffic load at the inner-loop of solving the lower-level problem, DFB incorporates a lightweight communication protocol for efficiently transmitting compressed residuals of local parameters. % during the inner loops. Rigorous theoretical analysis ensures its convergence % of the algorithm, indicating a first-order oracle calls of . Experiments on hyperparameter tuning and hyper-representation tasks validate the superiority of DFB across various typologies and heterogeneous data distributions.

Paper Structure

This paper contains 31 sections, 17 theorems, 71 equations, 6 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Under Assumption assump_smooth, if $\lambda \geq 2L_f/\mu$, it holds that

Figures (6)

  • Figure 1: Communication protocol of inner loop for $\text{C}^2$DFB
  • Figure 2: Comparison of upper-level test accuracy versus communication loads and training times for $\text{C}^2$DFB, MADSBO and MDBO under three topology on Coefficient Tuning task. The 'h' notation represents a heterogeneous data distribution across 10 clients, with a heterogeneity level set to 0.8 in the experiment.
  • Figure 3: Upper-level test loss comparison versus communication loads for $\text{C}^2$DFB, MADSBO and a naive compression version of $\text{C}^2$DFB under three topology on Hyper Representation task. The 'h' notation represents a heterogeneous data distribution across 10 clients, with a heterogeneity level set to 0.8 in the experiment.
  • Figure 4: Comparison of test loss against communication round for $\text{C}^2$DFB, MADSBO and MDBO under three topology on Coefficient Tuning task. The 'h' notation represents a heterogeneous data distribution across 10 clients, with a heterogeneity level set to 0.8 in the experiment.
  • Figure 5: Sensitive studies of $\text{C}^2$DFB, (1) varying the number of inner loops $K$ (left), (2) varying the compression ratio (middle), and (3) varying the multiplier $\sigma$(right).
  • ...and 1 more figures

Theorems & Definitions (29)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Lemma 1: pmlr-v202-kwon23cchen2023nearoptimalnonconvexstronglyconvexbileveloptimization
  • Lemma 2
  • Theorem 1
  • Corollary 1
  • Lemma 3
  • Theorem 2
  • ...and 19 more