On Penalty-based Bilevel Gradient Descent Method
Han Shen, Quan Xiao, Tianyi Chen
TL;DR
This paper addresses the difficulty of solving bilevel optimization problems by introducing a penalty-based reformulation that couples the upper- and lower-level problems through a penalty term. By selecting a $\rho$-squared-distance-bound penalty $p(x,y)$, the authors derive conditions under which solving the penalized problem $\mathcal{BP}_{\gamma p}$ yields $\epsilon$-approximate or exact global/local solutions to the original bilevel problem, even without lower-level strong convexity. They propose the penalty-based bilevel gradient descent (PBGD) and its stochastic variants, proving finite-time convergence in unconstrained lower-level settings and extending the framework to constrained and nonsmooth penalties. Empirical results on synthetic and real data demonstrate the efficiency and scalability of the approach against competitive baselines, highlighting potential for broader adoption in hyperparameter optimization, imaging, meta-learning, and adversarial scenarios where bilevel formulations arise.
Abstract
Bilevel optimization enjoys a wide range of applications in emerging machine learning and signal processing problems such as hyper-parameter optimization, image reconstruction, meta-learning, adversarial training, and reinforcement learning. However, bilevel optimization problems are traditionally known to be difficult to solve. Recent progress on bilevel algorithms mainly focuses on bilevel optimization problems through the lens of the implicit-gradient method, where the lower-level objective is either strongly convex or unconstrained. In this work, we tackle a challenging class of bilevel problems through the lens of the penalty method. We show that under certain conditions, the penalty reformulation recovers the (local) solutions of the original bilevel problem. Further, we propose the penalty-based bilevel gradient descent (PBGD) algorithm and establish its finite-time convergence for the constrained bilevel problem with lower-level constraints yet without lower-level strong convexity. Experiments on synthetic and real datasets showcase the efficiency of the proposed PBGD algorithm.
