Table of Contents
Fetching ...

Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems

Takuya Kanayama, Yuki Ito, Tomoyuki Tamura, Masayuki Karasuyama

TL;DR

This paper considers Bayesian optimization for the case that both the upper- and lower-levels involve expensive black-box functions and proposes an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values.

Abstract

A bilevel optimization problem consists of two optimization problems nested as an upper- and a lower-level problem, in which the optimality of the lower-level problem defines a constraint for the upper-level problem. This paper considers Bayesian optimization (BO) for the case that both the upper- and lower-levels involve expensive black-box functions. Because of its nested structure, bilevel optimization has a complex problem definition, by which bilevel BO has not been widely studied compared with other standard extensions of BO such as multi-objective or constraint problems. We propose an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values. This enables us to define a unified criterion that measures the benefit for both level problems, simultaneously. Further, we also show a practical lower bound based approach to evaluating the information gain. We empirically demonstrate the effectiveness of our proposed method through several benchmark datasets.

Information-Theoretic Bayesian Optimization for Bilevel Optimization Problems

TL;DR

This paper considers Bayesian optimization for the case that both the upper- and lower-levels involve expensive black-box functions and proposes an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values.

Abstract

A bilevel optimization problem consists of two optimization problems nested as an upper- and a lower-level problem, in which the optimality of the lower-level problem defines a constraint for the upper-level problem. This paper considers Bayesian optimization (BO) for the case that both the upper- and lower-levels involve expensive black-box functions. Because of its nested structure, bilevel optimization has a complex problem definition, by which bilevel BO has not been widely studied compared with other standard extensions of BO such as multi-objective or constraint problems. We propose an information-theoretic approach that considers the information gain of both the upper- and lower-optimal solutions and values. This enables us to define a unified criterion that measures the benefit for both level problems, simultaneously. Further, we also show a practical lower bound based approach to evaluating the information gain. We empirically demonstrate the effectiveness of our proposed method through several benchmark datasets.

Paper Structure

This paper contains 42 sections, 1 theorem, 49 equations, 15 figures.

Key Result

Theorem 3.1

Let $(m_1^f, s_1^f)$, $(m_2^f, s_2^f)$, and $(m_3^f, s_3^f)$ be the mean and standard deviation of $p(f(\boldsymbol{x},\boldsymbol{\theta}^*(\boldsymbol{x})) \mid y^f_{(\boldsymbol{x}, \boldsymbol{\theta})}, {\cal D}_t^+)$, $p(f(\boldsymbol{x},\boldsymbol{\theta}^*(\boldsymbol{x})) \mid {\cal D}_t^+ where $\phi$ and $\Phi$ are the probability density function (PDF) and cumulative density function

Figures (15)

  • Figure 1: Example of Bilevel optimization ($d_{\cal X} = 1, d_\Theta = 1$) and its optimal solution. For each upper-level variable $x$, the feasible solution is defined by $\theta^*(x) = \arg\max_{\theta} g(x,\theta)$, shown as the orange line. The optimal solution $(x^*, \theta^*)$ is the maximizer of $f$ on the orange line.
  • Figure 2: Schematic illustration of random variables in $\mathrm{MI}( y^f_{(\boldsymbol{x}, \boldsymbol{\theta})}, y^g_{(\boldsymbol{x}, \boldsymbol{\theta})} \ ; \ o^* \mid {\cal D}_t )$. The left panel represents input space of $f$ and $g$ (heatmap is the true functions), and the right panel is the output space. The predictions at current $(\boldsymbol{x}, \boldsymbol{\theta})$ are $y^f_{(\boldsymbol{x}, \boldsymbol{\theta})}$ and $y^g_{(\boldsymbol{x}, \boldsymbol{\theta})}$, depicted as the green distribution in the output space. At the optimal solution $(\boldsymbol{x}^*, \boldsymbol{\theta}^*)$, which is also a random variable defined through the GP posteriors, the objective function values are $f^* = f(\boldsymbol{x}^*,\boldsymbol{\theta}^*)$ and $g^* = g(\boldsymbol{x}^*,\boldsymbol{\theta}^*)$. The optimal values $f^*$ and $g^*$ are represented as the blue distribution in the output space. The optimal solution $(\boldsymbol{x}^*, \boldsymbol{\theta}^*)$, represented as the red distribution in the input space, should exist on $\boldsymbol{\theta}^*(\boldsymbol{x})$, i.e., the optimal point of the lower-level problem. The red and blue stars are 'samples' of the optimal solutions $(\boldsymbol{x}^*, \boldsymbol{\theta}^*)$ and values $(f^*, g^*)$, respectively.
  • Figure 3: Regret comparison on functions from the GP prior.
  • Figure 4: Regret comparison on benchmark and real-world problems.
  • Figure 5: Regret comparison on decoupled setting.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Theorem 3.1