Table of Contents
Fetching ...

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

Shuhei Watanabe

TL;DR

The paper demystifies the Tree-structured Parzen Estimator by systematically abating its components and showing how each control parameter influences exploitation versus exploration. It provides a comprehensive set of default, empirically validated settings, including multivariate KDEs, priors, and bandwidth strategies, and demonstrates robust performance improvements over traditional TPE implementations and several baselines. The work emphasizes noise-aware bandwidth selection and presents practical guidelines for applying TPE across diverse benchmark and real-world HPO tasks. Overall, it offers actionable insights to tune TPE for better empirical performance with clear recommendations and thorough experimental validation.

Abstract

Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-structured Parzen estimator (TPE) is a widely used Bayesian optimization method in recent parameter tuning frameworks such as Hyperopt and Optuna. Despite its popularity, the roles of each control parameter in TPE and the algorithm intuition have not been discussed so far. The goal of this paper is to identify the roles of each control parameter and their impacts on parameter tuning based on the ablation studies using diverse benchmark datasets. The recommended setting concluded from the ablation studies is demonstrated to improve the performance of TPE. Our TPE implementation used in this paper is available at https://github.com/nabenabe0928/tpe/tree/single-opt.

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

TL;DR

The paper demystifies the Tree-structured Parzen Estimator by systematically abating its components and showing how each control parameter influences exploitation versus exploration. It provides a comprehensive set of default, empirically validated settings, including multivariate KDEs, priors, and bandwidth strategies, and demonstrates robust performance improvements over traditional TPE implementations and several baselines. The work emphasizes noise-aware bandwidth selection and presents practical guidelines for applying TPE across diverse benchmark and real-world HPO tasks. Overall, it offers actionable insights to tune TPE for better empirical performance with clear recommendations and thorough experimental validation.

Abstract

Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-structured Parzen estimator (TPE) is a widely used Bayesian optimization method in recent parameter tuning frameworks such as Hyperopt and Optuna. Despite its popularity, the roles of each control parameter in TPE and the algorithm intuition have not been discussed so far. The goal of this paper is to identify the roles of each control parameter and their impacts on parameter tuning based on the ablation studies using diverse benchmark datasets. The recommended setting concluded from the ablation studies is demonstrated to improve the performance of TPE. Our TPE implementation used in this paper is available at https://github.com/nabenabe0928/tpe/tree/single-opt.
Paper Structure (37 sections, 25 equations, 45 figures, 14 tables, 1 algorithm)

This paper contains 37 sections, 25 equations, 45 figures, 14 tables, 1 algorithm.

Figures (45)

  • Figure 1: The conceptual visualization of TPE. Left: the objective function $y = f(\boldsymbol{x})$ (black dashed line) and its observations $\mathcal{D}$. The magnified figure shows the boundary $y = y^\gamma$ (green dotted line) of $\mathcal{D}^{(l)}$ (red squares) and $\mathcal{D}^{(g)}$ (blue squares). Top right: the KDEs built by $\mathcal{D}^{(l)}$ (red solid line) and $\mathcal{D}^{(g)}$ (blue solid line). Bottom right: the density ratio $p(\boldsymbol{x} |\mathcal{D}^{(l)}) / p(\boldsymbol{x} | \mathcal{D}^{(g)})$ (purple dotted line) used for the acquisition function. We pick the configuration with the best acquisition function value (green star) in the samples (black triangles) from $p(\boldsymbol{x} |\mathcal{D}^{(l)})$.
  • Figure 2: The optimizations of the Styblinski function using the splitting algorithm linear and sqrt. The red and blue dots show the observations till each "X evaluations". The lower left blue shade in each figure is the optimal point and this area should be found with as few observations as possible. Left column: the optimization using linear. The optimal area is found with around $500$ evaluations owing to strong exploitation. Right column: the optimization using sqrt. The optimal area is found with around $100$ evaluations thanks to exploration. Although there is no observation near the optimal area for both methods at $50$ evaluations, sqrt finds the optimal area thanks to its exploration nature.
  • Figure 3: The optimizations of the Styblinski function using the splitting algorithm linear with different $\beta_1$ (Left column: $\beta_1 = 0.05$, Center column: $\beta_1 = 0.15$, Right column: $\beta_1 = 0.25$). The blue dots show the observations till each "X evaluations". The lower left blue shade in each figure is the optimal point and this area should be found with as few observations as possible. We see scattered dots (more explorative) for a small $\beta_1$ and concentrated dots (more exploitative) for a large $\beta_1$.
  • Figure 4: The optimizations of the Styblinski function using the splitting algorithm sqrt with different $\beta_2$ (Left column: $\beta_2 = 0.25$, Center column: $\beta_2 = 0.5$, Right column: $\beta_2 = 0.75$). The red dots show the observations till each "X evaluations". The lower left blue shade in each figure is the optimal point and this area should be found with as few observations as possible We see scattered dots (more explorative) for a small $\beta_2$ and concentrated dots (more exploitative) for a large $\beta_2$.
  • Figure 5: The distributions of each weighting algorithm when using $N^{(g)} = 100$. Left: the weight distribution for the uniform. Right: the weight distribution for the old decay. Older observations get lower weights and the latest 25 observations get the uniform weight.
  • ...and 40 more figures