Table of Contents
Fetching ...

Provably Convergent Federated Trilevel Learning

Yang Jiao, Kai Yang, Tiancheng Wu, Chengtao Jian, Jianwei Huang

TL;DR

The paper tackles privacy-preserving trilevel optimization by introducing AFTO, an asynchronous federated framework that uses a novel hyper-polyhedral approximation built from $μ$-cuts to handle $μ$-weakly convex objectives. It establishes non-asymptotic convergence with a rate of $T(ε)=O\left(\frac{1}{ε^2}\right)$ to obtain an $ε$-stationary point and demonstrates substantial empirical acceleration (up to ~80%) on real-world tasks. By reformulating the TLO as a consensus problem and layering two sets of cutting planes, the approach enables distributed optimization with stragglers and bounded staleness, while refining the approximation via iterative $μ$-cuts and pruning inactive planes. The method is validated on distributed robust hyperparameter optimization and domain adaptation, showing faster convergence and superior performance against state-of-the-art baselines. Overall, the work advances scalable, privacy-preserving TLO with concrete convergence guarantees and practical impact for complex ML systems.

Abstract

Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes $μ$-cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed $μ$-cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the $μ$-weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain $ε$-stationary point is upper bounded by $\mathcal{O}(\frac{1}{ε^2})$. Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80$\%$.

Provably Convergent Federated Trilevel Learning

TL;DR

The paper tackles privacy-preserving trilevel optimization by introducing AFTO, an asynchronous federated framework that uses a novel hyper-polyhedral approximation built from -cuts to handle -weakly convex objectives. It establishes non-asymptotic convergence with a rate of to obtain an -stationary point and demonstrates substantial empirical acceleration (up to ~80%) on real-world tasks. By reformulating the TLO as a consensus problem and layering two sets of cutting planes, the approach enables distributed optimization with stragglers and bounded staleness, while refining the approximation via iterative -cuts and pruning inactive planes. The method is validated on distributed robust hyperparameter optimization and domain adaptation, showing faster convergence and superior performance against state-of-the-art baselines. Overall, the work advances scalable, privacy-preserving TLO with concrete convergence guarantees and practical impact for complex ML systems.

Abstract

Trilevel learning, also called trilevel optimization (TLO), has been recognized as a powerful modelling tool for hierarchical decision process and widely applied in many machine learning applications, such as robust neural architecture search, hyperparameter optimization, and domain adaptation. Tackling TLO problems has presented a great challenge due to their nested decision-making structure. In addition, existing works on TLO face the following key challenges: 1) they all focus on the non-distributed setting, which may lead to privacy breach; 2) they do not offer any non-asymptotic convergence analysis which characterizes how fast an algorithm converges. To address the aforementioned challenges, this paper proposes an asynchronous federated trilevel optimization method to solve TLO problems. The proposed method utilizes -cuts to construct a hyper-polyhedral approximation for the TLO problem and solve it in an asynchronous manner. We demonstrate that the proposed -cuts are applicable to not only convex functions but also a wide range of non-convex functions that meet the -weakly convex assumption. Furthermore, we theoretically analyze the non-asymptotic convergence rate for the proposed method by showing its iteration complexity to obtain -stationary point is upper bounded by . Extensive experiments on real-world datasets have been conducted to elucidate the superiority of the proposed method, e.g., it has a faster convergence rate with a maximum acceleration of approximately 80.
Paper Structure (18 sections, 4 theorems, 32 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 32 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

The feasible region of constraint $h_{\rm{I}}(\{{\boldsymbol{x}_{3,j}}\},\!\boldsymbol{z}_1, {\boldsymbol{z}_2}',\boldsymbol{z}_3 )\! \le \! \varepsilon_{\rm{I}}$ is a subset of the ${\rm{I}}^{\rm{st}}$ layer polytope $P_{\rm{I}}^{t}\!=\! \{{\boldsymbol{a}_{1,l}^{\rm{I}}}^{\top}\!{\boldsymbol{z}_1}

Figures (2)

  • Figure 1: MSE of clean test data and test data with Gaussian noise on (a) Diabetes, (b) Boston, (c) Red-wine quality, and (d) White-wine quality datasets. All experiments are repeated five times, and the shaded areas represent the standard deviation.
  • Figure 2: (a) Test accuracy and (b) test loss vs running time when SVHN is utilized to pretrain the model. (c) Test accuracy and (d) test loss vs running time when MNIST is utilized to pretrain the model. All experiments are repeated five times.

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Proposition 1
  • Proposition 2
  • Definition 3
  • Definition 4
  • Theorem 1
  • Theorem 2