Table of Contents
Fetching ...

Perturbed Gradient Descent via Convex Quadratic Approximation for Nonconvex Bilevel Optimization

Nazanin Abolfazli, Sina Sharifi, Mahyar Fazlyab, Erfan Yazdandoost Hamedani

TL;DR

This paper introduces a discretized variant of RXGF and formulate convex quadratic program subproblems with closed-form solutions and provides a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption {(lower-level gradient PL assumption)}, the method achieves an iteration complexity of 1/1 in terms of the squared norm of the KKT residual for the reformulated problem.

Abstract

Bilevel optimization is a fundamental tool in hierarchical decision-making and has been widely applied to machine learning tasks such as hyperparameter tuning, meta-learning, and continual learning. While significant progress has been made in bilevel optimization, existing methods predominantly focus on the {nonconvex-strongly convex, or the} nonconvex-PL settings, leaving the more general nonconvex-nonconvex framework underexplored. In this paper, we address this gap by developing an efficient gradient-based method inspired by the recently proposed Relaxed Gradient Flow (RXGF) framework with a continuous-time dynamic. In particular, we introduce a discretized variant of RXGF and formulate convex quadratic program subproblems with closed-form solutions. We provide a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption {(lower-level gradient PL assumption)}, our method achieves an iteration complexity of $\mathcal{O}(1/ε^{1.5})$ in terms of the squared norm of the KKT residual for the reformulated problem. Moreover, even in the absence of the regularity assumption, we establish an iteration complexity of $\mathcal{O}(1/ε^{3})$ for the same metric. Through extensive numerical experiments on convex and nonconvex synthetic benchmarks and a hyper-data cleaning task, we illustrate the efficiency and scalability of our approach.

Perturbed Gradient Descent via Convex Quadratic Approximation for Nonconvex Bilevel Optimization

TL;DR

This paper introduces a discretized variant of RXGF and formulate convex quadratic program subproblems with closed-form solutions and provides a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption {(lower-level gradient PL assumption)}, the method achieves an iteration complexity of 1/1 in terms of the squared norm of the KKT residual for the reformulated problem.

Abstract

Bilevel optimization is a fundamental tool in hierarchical decision-making and has been widely applied to machine learning tasks such as hyperparameter tuning, meta-learning, and continual learning. While significant progress has been made in bilevel optimization, existing methods predominantly focus on the {nonconvex-strongly convex, or the} nonconvex-PL settings, leaving the more general nonconvex-nonconvex framework underexplored. In this paper, we address this gap by developing an efficient gradient-based method inspired by the recently proposed Relaxed Gradient Flow (RXGF) framework with a continuous-time dynamic. In particular, we introduce a discretized variant of RXGF and formulate convex quadratic program subproblems with closed-form solutions. We provide a rigorous convergence analysis, demonstrating that under the existence of a KKT point and a regularity assumption {(lower-level gradient PL assumption)}, our method achieves an iteration complexity of in terms of the squared norm of the KKT residual for the reformulated problem. Moreover, even in the absence of the regularity assumption, we establish an iteration complexity of for the same metric. Through extensive numerical experiments on convex and nonconvex synthetic benchmarks and a hyper-data cleaning task, we illustrate the efficiency and scalability of our approach.

Paper Structure

This paper contains 28 sections, 6 theorems, 52 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 4.1

Suppose that Assumptions assumption:upperlevel and assumption:lowerlevel hold and $\rho=\|\nabla h\|^2$. Let $\{(x_k,y_k,\lambda_k)\}_{k=0}^{K-1}$ be the sequence generated by alg with $C_0>0$ and step size $\gamma >0$ such that $\gamma\leq \min\{\alpha,\frac{1}{L_f+\alpha L_h}\}$. Then for all $K \ and,

Figures (5)

  • Figure 1: Overview of the proposed method. The QP takes the gradient directions, perturbs them according to the lower-level problem, and then the variables are updated using the new directions.
  • Figure 2: Effect of the number of iterations on the convergence on the strongly convex synthetic example problem and comparison with BOME liu2022bome on the non-convex synthetic example benchmark. Parameter choice from (leftmost)\ref{['thm:conv_rate']} and (mid-left)\ref{['thm:conv_rate2']} on the synthetic example with the strongly convex lower-level function. (mid-right & rightmost) comparison with BOME liu2022bome on the synthetic example with non-convex lower-level function.
  • Figure 3: Comparison between our method with the state of the art on the DHC benchmark with corruption rate $p=25\%$. The first two plots show the validation loss and the accuracy of the test set on the DHC benchmark with PCA. The last two plots show the validation loss and the accuracy of the test set on the large-scale DHC problem.
  • Figure 4: Comparisons of the validation loss and test accuracy between our method from \ref{['thm:conv_rate']} with BOME and VPBGD on the DHC problem with neural network classifier.
  • Figure 5: Comparison of our method, AIDBiO, and BOME on the coreset selection problem.

Theorems & Definitions (20)

  • Definition 2.1
  • Remark 3.1: Computing $\nabla_x h$ and $\nabla_y h$
  • Theorem 4.1
  • proof
  • Corollary 4.2
  • proof
  • Corollary 4.3
  • proof
  • Remark 4.1
  • Theorem 4.4
  • ...and 10 more