Table of Contents
Fetching ...

Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization

Zhenzhang Ye, Gabriel Peyré, Daniel Cremers, Pierre Ablin

TL;DR

The paper tackles hypergradient estimation in bilevel optimization, where the outer gradient is computed via the Implicit Function Theorem and is sensitive to inner-solution error. It develops two consistent strategies to reduce this error: preconditioning the inner problem and reparameterizing the inner problem, and it derives how these strategies affect the hypergradient through the Jacobian $\tilde{\Omega}_1$ and the efficiency constant $C_y$. A key contribution is the notion of super-efficiency, where the hypergradient error decays quadratically with inner-resolution error, achievable under specific structures (e.g., Newton-like preconditioning with $P=F_1$ or certain affine outer objectives) and in localized/separable reformulations under favorable conditions. Theoretical analysis is complemented by numerical experiments on ridge and logistic regression that illustrate when preconditioning generally outperforms reparameterization and when carefully designed reparameterizations can compensate for poor Hessian approximations, providing practical guidance for scalable bilevel learning tasks.

Abstract

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.

Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization

TL;DR

The paper tackles hypergradient estimation in bilevel optimization, where the outer gradient is computed via the Implicit Function Theorem and is sensitive to inner-solution error. It develops two consistent strategies to reduce this error: preconditioning the inner problem and reparameterizing the inner problem, and it derives how these strategies affect the hypergradient through the Jacobian and the efficiency constant . A key contribution is the notion of super-efficiency, where the hypergradient error decays quadratically with inner-resolution error, achievable under specific structures (e.g., Newton-like preconditioning with or certain affine outer objectives) and in localized/separable reformulations under favorable conditions. Theoretical analysis is complemented by numerical experiments on ridge and logistic regression that illustrate when preconditioning generally outperforms reparameterization and when carefully designed reparameterizations can compensate for poor Hessian approximations, providing practical guidance for scalable bilevel learning tasks.

Abstract

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-called hypergradient of the outer problem is to use the Implicit Function Theorem (IFT). As a function of the error of the inner problem resolution, we study the error of the IFT method. We analyze two strategies to reduce this error: preconditioning the IFT formula and reparameterizing the inner problem. We give a detailed account of the impact of these two modifications on the error, highlighting the role played by higher-order derivatives of the functionals at stake. Our theoretical findings explain when super efficiency, namely reaching an error on the hypergradient that depends quadratically on the error on the inner problem, is achievable and compare the two approaches when this is impossible. Numerical evaluations on hyperparameter tuning for regression problems substantiate our theoretical findings.
Paper Structure (26 sections, 30 theorems, 66 equations, 3 figures, 1 table)

This paper contains 26 sections, 30 theorems, 66 equations, 3 figures, 1 table.

Key Result

Proposition 1

If $\tilde{\Omega}$ is $C^1$ and consistent, then for all $\hat{x}$ and $y$

Figures (3)

  • Figure 1: Compare $P^{\text{Newton}}$ from Prop. \ref{['prop:ideal_precond']} and ${\psi}^{\text{opt}}_{x, \bar{y}}$ from Prop. \ref{['prop:ideal_sep_rep']} on ridge regression but with different outer problems. We show the efficiency constant $C_y$ in $\log$ space under different $y$. (a) When the outer problem is affine, both strategies can achieve a small efficiency constant $C_y$ around machine accuracy. (b) When the outer problem is quadratic, the Newton preconditioner achieves the super efficiency while the ${\psi}^{\text{opt}}_{x, \bar{y}}$ has a large constant $C_y$.
  • Figure 2: Compare $P^{\text{diag}}$, $\psi^{\text{exp}}_{x}$, $\psi^{\text{diag}}_{x}$ on ridge regression with the outer problem $g$. (a) We show the hypergradient errors of different strategies in log space over the number of iterations when having a bad preconditioner. It turns out that $\psi^{\text{exp}}_{x}$ could be a better choice in this setting. (b) We show the efficiency constant $C_y$ of each strategy in log space under different $y$. Although reparameterization could perform better in some cases, $P^{\text{diag}}$ in general is the best choice.
  • Figure 3: Comparison of different strategies on hypergradient error in log space over approximated root for logistic regression. $P^{\text{Newton}}$ always achieve the super efficiency. (a) With a small $y$, the performances of $P^{\text{diag}}$ and $\psi^{\text{diag}}_{x}$ are nearly the same. $\psi^{\text{exp}}_{x}$ performs worse than the vanilla one in some situations. (b) The performance of $P^{\text{diag}}$ improves thanks to the large $y$ which leads to a diagonally dominated $F_1$. The performances of two reparameterizations are the similar, both better than the vanilla one.

Theorems & Definitions (54)

  • Definition 1: Consistency
  • Proposition 1: Hypergradient approximation
  • Definition 2: Super efficiency ablin2020super
  • Proposition 2: Jacobian of estimation
  • Proposition 3: IFT efficiency
  • Proposition 4: Preconditioned estimation
  • proof
  • Proposition 5: Newton-like preconditioner
  • proof
  • Proposition 6
  • ...and 44 more