Table of Contents
Fetching ...

Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

Yubo Zhou, Jun Shu, Junmin Liu, Deyu Meng

TL;DR

A bias-variance decomposition for hypergradient estimation error is conducted and a supplemental detailed analysis of the variance term ignored by previous works is provided, facilitating an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set.

Abstract

Gradient-based hyperparameter optimization (HPO) have emerged recently, leveraging bilevel programming techniques to optimize hyperparameter by estimating hypergradient w.r.t. validation loss. Nevertheless, previous theoretical works mainly focus on reducing the gap between the estimation and ground-truth (i.e., the bias), while ignoring the error due to data distribution (i.e., the variance), which degrades performance. To address this issue, we conduct a bias-variance decomposition for hypergradient estimation error and provide a supplemental detailed analysis of the variance term ignored by previous works. We also present a comprehensive analysis of the error bounds for hypergradient estimation. This facilitates an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set. Inspired by the derived theories, we propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively. Experimental results on tasks including regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that our variance reduction strategy improves hypergradient estimation. To explain the improved performance, we establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.

Understanding the Generalization of Bilevel Programming in Hyperparameter Optimization: A Tale of Bias-Variance Decomposition

TL;DR

A bias-variance decomposition for hypergradient estimation error is conducted and a supplemental detailed analysis of the variance term ignored by previous works is provided, facilitating an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set.

Abstract

Gradient-based hyperparameter optimization (HPO) have emerged recently, leveraging bilevel programming techniques to optimize hyperparameter by estimating hypergradient w.r.t. validation loss. Nevertheless, previous theoretical works mainly focus on reducing the gap between the estimation and ground-truth (i.e., the bias), while ignoring the error due to data distribution (i.e., the variance), which degrades performance. To address this issue, we conduct a bias-variance decomposition for hypergradient estimation error and provide a supplemental detailed analysis of the variance term ignored by previous works. We also present a comprehensive analysis of the error bounds for hypergradient estimation. This facilitates an easy explanation of some phenomena commonly observed in practice, like overfitting to the validation set. Inspired by the derived theories, we propose an ensemble hypergradient strategy to reduce the variance in HPO algorithms effectively. Experimental results on tasks including regularization hyperparameter learning, data hyper-cleaning, and few-shot learning demonstrate that our variance reduction strategy improves hypergradient estimation. To explain the improved performance, we establish a connection between excess error and hypergradient estimation, offering some understanding of empirical observations.
Paper Structure (55 sections, 24 theorems, 114 equations, 13 figures, 11 tables, 4 algorithms)

This paper contains 55 sections, 24 theorems, 114 equations, 13 figures, 11 tables, 4 algorithms.

Key Result

Proposition 1

$\widehat{\nabla}{f}(\boldsymbol{\lambda})$ takes the analytical form of $\widehat{\nabla}{f}(\boldsymbol{\lambda})=\nabla_{\boldsymbol{\lambda}}\hat{\mathcal{R}}^{val}(\boldsymbol{\lambda},\boldsymbol{\theta}_K)-\alpha_{in}\sum_{k=0}^{K-1}\nabla^2_{\boldsymbol{\lambda}\boldsymbol{\theta}}\hat{\math

Figures (13)

  • Figure 1: Illustration of the impact of variance on hypergradient estimation across multiple data splittings. For the setting, we use 5-dimensional data for fitting elastic network, i.e., $\min_{\boldsymbol{\theta}}\{\sum_{i=1}^N(y_i-x_i^T\boldsymbol{\theta})^2+\lambda_1\Vert\boldsymbol{\theta}\Vert_1+\lambda_2\Vert\boldsymbol{\theta}\Vert_2^2\}$, and hyperparameter $\lambda_1$ and $\lambda_2$ are set to the regularization coefficients of L1 and L2 norms. For the RHG and RHG(+EHG, as mentioned in Section \ref{['sec4-1']}) methods, we repeat the experiments 100 times with different random seeds, where RHG franceschi2017forward is a classic ITD. $U$ denotes the number of splittings for EHG. For details, please see Appendix \ref{['sectionA16']}.
  • Figure 2: OEHG Algorithm overview. At the iteration $t$, the inner-level first updates the model parameter of these $U$ data splittings by Eq. (\ref{['equation53-1']}). The outer-level then updates hyperparameter by Eq. (\ref{['equation53-2']}). The updated hyperparameter is further used to update model paratmeter by Eq. (\ref{['equation53-3']}).
  • Figure 3: (a): Visualization of hypergradient error, bias, and variance. error_theory is calculated by the generated data distribution. (b-f): Visualization of hypergradient in HPO. The inner sub-problem is solved via the closed-form solution of ridge regression.
  • Figure 4: Error ratio (RHG+EHG test error for $U$=5/RHG test error for $U$=1) results, where $U$ denotes the number of splittings. Red circle represents the position where they are equal. Inside red circle indicates that test error of RHG+EHG is smaller than that of RHG, and vice versa. 4 figures correspond to different models: lasso regression, ridge regression, logistic regression, support vector machine.
  • Figure 5: Left: Illustration of the HPO process of AID and AID+EHG ($U=16$). Specific AID is AID-FP. The curves and shaded regions represent the mean and standard deviation calculated from 10 repeated experiments. Right: Illustration of the HPO process of EHG under AID. GroundTruth curve is calculated by the analytical solution of the lower-level problem.
  • ...and 8 more figures

Theorems & Definitions (27)

  • Proposition 1
  • Theorem 5
  • Theorem 6
  • Lemma 7
  • Theorem 8
  • Lemma 9
  • Theorem 10
  • Lemma 11
  • Theorem 12
  • Lemma 13
  • ...and 17 more