Table of Contents
Fetching ...

A Unified Gaussian Process for Branching and Nested Hyperparameter Optimization

Jiazhao Zhang, Ying Hung, Chung-Ching Lin, Zicheng Liu

TL;DR

The paper addresses conditional hyperparameter dependencies in deep learning by introducing a unified Gaussian process model that handles branching and nested tuning parameters. A three-type product kernel $R(\mathbf{x},\mathbf{x}') = R_{\boldsymbol{\theta}}(\mathbf{w},\mathbf{w}') R_{\boldsymbol{\gamma}}(\mathbf{z},\mathbf{z}') R_{\boldsymbol{\phi}}(\mathbf{v},\mathbf{v}')$ is proposed to couple continuous, branching, and nested inputs, with a nested-kernel $R_{\boldsymbol{\phi}_k}$ and a sufficient condition for positive definiteness $\min_b [ \exp(-\phi^b_{kj})+(1-\exp(-\phi^b_{kj}))/g^b_j ] \ge \exp(-\gamma_k)$. The framework provides convergence guarantees in an RKHS and an EI acquisition that accounts for conditional structure, achieving a simple-regret rate of $\mathcal{O}(L^{\nu/d}(n/\log n)^{-\nu/d}(\log n)^{\alpha})$ under a continuum-armed-bandit setting. Empirical results on synthetic functions and CIFAR-100/ResNet/MobileNet hyperparameter tuning demonstrate higher prediction accuracy and faster optimization than strong baselines, with informative sensitivity analyses revealing how hyperparameters interact to affect accuracy.

Abstract

Choosing appropriate hyperparameters plays a crucial role in the success of neural networks as hyper-parameters directly control the behavior and performance of the training algorithms. To obtain efficient tuning, Bayesian optimization methods based on Gaussian process (GP) models are widely used. Despite numerous applications of Bayesian optimization in deep learning, the existing methodologies are developed based on a convenient but restrictive assumption that the tuning parameters are independent of each other. However, tuning parameters with conditional dependence are common in practice. In this paper, we focus on two types of them: branching and nested parameters. Nested parameters refer to those tuning parameters that exist only within a particular setting of another tuning parameter, and a parameter within which other parameters are nested is called a branching parameter. To capture the conditional dependence between branching and nested parameters, a unified Bayesian optimization framework is proposed. The sufficient conditions are rigorously derived to guarantee the validity of the kernel function, and the asymptotic convergence of the proposed optimization framework is proven under the continuum-armed-bandit setting. Based on the new GP model, which accounts for the dependent structure among input variables through a new kernel function, higher prediction accuracy and better optimization efficiency are observed in a series of synthetic simulations and real data applications of neural networks. Sensitivity analysis is also performed to provide insights into how changes in hyperparameter values affect prediction accuracy.

A Unified Gaussian Process for Branching and Nested Hyperparameter Optimization

TL;DR

The paper addresses conditional hyperparameter dependencies in deep learning by introducing a unified Gaussian process model that handles branching and nested tuning parameters. A three-type product kernel is proposed to couple continuous, branching, and nested inputs, with a nested-kernel and a sufficient condition for positive definiteness . The framework provides convergence guarantees in an RKHS and an EI acquisition that accounts for conditional structure, achieving a simple-regret rate of under a continuum-armed-bandit setting. Empirical results on synthetic functions and CIFAR-100/ResNet/MobileNet hyperparameter tuning demonstrate higher prediction accuracy and faster optimization than strong baselines, with informative sensitivity analyses revealing how hyperparameters interact to affect accuracy.

Abstract

Choosing appropriate hyperparameters plays a crucial role in the success of neural networks as hyper-parameters directly control the behavior and performance of the training algorithms. To obtain efficient tuning, Bayesian optimization methods based on Gaussian process (GP) models are widely used. Despite numerous applications of Bayesian optimization in deep learning, the existing methodologies are developed based on a convenient but restrictive assumption that the tuning parameters are independent of each other. However, tuning parameters with conditional dependence are common in practice. In this paper, we focus on two types of them: branching and nested parameters. Nested parameters refer to those tuning parameters that exist only within a particular setting of another tuning parameter, and a parameter within which other parameters are nested is called a branching parameter. To capture the conditional dependence between branching and nested parameters, a unified Bayesian optimization framework is proposed. The sufficient conditions are rigorously derived to guarantee the validity of the kernel function, and the asymptotic convergence of the proposed optimization framework is proven under the continuum-armed-bandit setting. Based on the new GP model, which accounts for the dependent structure among input variables through a new kernel function, higher prediction accuracy and better optimization efficiency are observed in a series of synthetic simulations and real data applications of neural networks. Sensitivity analysis is also performed to provide insights into how changes in hyperparameter values affect prediction accuracy.
Paper Structure (15 sections, 4 theorems, 24 equations, 6 figures, 6 tables)

This paper contains 15 sections, 4 theorems, 24 equations, 6 figures, 6 tables.

Key Result

Theorem 1

Suppose that there are $g^b_j$ levels in the nested variable $v_j^{b}$ which is nested within the branching variable $z_k=b$, for any $b\in\{1,2,\dots,l_k\}$. The kernel function in (eq:productkernel) is symmetric and positive definite if the hyperparameter $\boldsymbol{\phi}_k$ satisfy: for all $j \in \{1,2,\ldots, m_k\}$ and $k\in\{1,2,\ldots, q\}$.

Figures (6)

  • Figure 1: An illustration of the synthetic function with one branching parameter $z$, one correspond nested parameter $v$, and two quantitative parameters, $x_1$ and $x_2$.For the five different combinations of the branching and nested parameters, this picture shows the projected function onto $x_1$ at $x_2=0$.
  • Figure 2: Compare B$\&$N with five existing methods based on synthetic data. Starting from 10 randomly generated initials, two adaptive procedures, sequential and batch, are conducted to include additional 50 observations. The results based on the sequential procedure are shown in the upper panel and those for the batch procedure with batch size 5 are shown in the lower panel. The plots on the left show the best function value as the search progresses. B$\&$N appears to be able to identify the optimal setting and reaches the global optimal much faster than the other methods. For each method, the box-plots on the right are the final optimal results summarized from 20 replicates at the end of the search. The average optimal results found by B$\&$N consistently outperform the existing methods with a smaller variation.
  • Figure 3: Optimal tuning for CNN networks using datasets, CIFAR-10 (left) and CIFAR-100 (right). The proposed B$\&$N is implemented based on the sequential and the batch procedure with batch size 8. The performance is compared with CoCaBO and BanditBO using the sequential procedure, which is the most efficient alternative found by numerical studies. Starting from 28 initial settings randomly generated Latin hypercube designs, 56 additional settings are evaluated. The plots on the left show the best prediction accuracy as the search progresses. For each method, the box plots on the right are the optimal prediction accuracy summarized from 10 replicates at the end of the search. It appears that the B$\&$N procedures outperform CoCaBO and BanditBO, and the sequential procedure of B$\&$N converges to a stationary point faster than the batch procedure.
  • Figure 4: Sensitivity Analysis. The left panel is the marginal main effects of the five shared variables. Weight decay, denoted by 'wd', has the most significant decreasing effect on accuracy and an extremely low learning rate can lead to a low prediction accuracy. The middle panel is an illustration of the interaction effect between epoch and depth when ResNet is implemented. The setting of depth shows a slightly decreasing effect for smaller numbers of epoch but the effect becomes concave for larger numbers of epoch. The right panel shows the interaction between depth and momentum where the effect of depth increases slightly for a smaller momentum but decreases for a larger momentum.
  • Figure 5: Two-factor interaction plots for the five shared variables, where learning rate is denoted by 'lr', epoch is denoted by 'epo', batch is denoted by 'bat', momentum is denoted by 'mom', and weight decay is denoted by 'wd'.
  • ...and 1 more figures

Theorems & Definitions (7)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • proof
  • Theorem 4
  • proof