Unlocking TriLevel Learning with Level-Wise Zeroth Order Constraints: Distributed Algorithms and Provable Non-Asymptotic Convergence
Yang Jiao, Kai Yang, Chengtao Jian
TL;DR
This work tackles distributed trilevel optimization with level-wise zeroth-order constraints, addressing the absence of gradient information in practical, privacy-preserving settings. It introduces DTZO, a gradient-free framework that builds cascaded zeroth-order polynomial approximations through zeroth-order cuts and a consensus-based distributed algorithm, accompanied by non-asymptotic convergence guarantees to an $ε$-stationary point. Theoretical results quantify iteration and communication complexities and reveal a tunable trade-off via a cascade-refinement horizon parameter $T_1$. Empirically, DTZO demonstrates superior performance on black-box trilevel learning with LLMs and on robust hyperparameter optimization tasks, validating effectiveness, scalability, and robustness to smoothing choices.
Abstract
Trilevel learning (TLL) found diverse applications in numerous machine learning applications, ranging from robust hyperparameter optimization to domain adaptation. However, existing researches primarily focus on scenarios where TLL can be addressed with first order information available at each level, which is inadequate in many situations involving zeroth order constraints, such as when black-box models are employed. Moreover, in trilevel learning, data may be distributed across various nodes, necessitating strategies to address TLL problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the TLL problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) TLL problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $ε$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO, e.g., it approximately achieves up to a 40$\%$ improvement in performance.
