Table of Contents
Fetching ...

Transition Constrained Bayesian Optimization via Markov Decision Processes

Jose Pablo Folch, Calvin Tsay, Robert M Lee, Behrang Shafei, Weronika Ormaniec, Andreas Krause, Mark van der Wilk, Ruth Misener, Mojmír Mutný

TL;DR

This work iteratively solve a tractable linearization of the utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon, a parallel to the optimization of an acquisition function in policy space.

Abstract

Bayesian optimization is a methodology to optimize black-box functions. Traditionally, it focuses on the setting where you can arbitrarily query the search space. However, many real-life problems do not offer this flexibility; in particular, the search space of the next query may depend on previous ones. Example challenges arise in the physical sciences in the form of local movement constraints, required monotonicity in certain variables, and transitions influencing the accuracy of measurements. Altogether, such transition constraints necessitate a form of planning. This work extends classical Bayesian optimization via the framework of Markov Decision Processes. We iteratively solve a tractable linearization of our utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon. This is a parallel to the optimization of an acquisition function in policy space. The resulting policy is potentially history-dependent and non-Markovian. We showcase applications in chemical reactor optimization, informative path planning, machine calibration, and other synthetic examples.

Transition Constrained Bayesian Optimization via Markov Decision Processes

TL;DR

This work iteratively solve a tractable linearization of the utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon, a parallel to the optimization of an acquisition function in policy space.

Abstract

Bayesian optimization is a methodology to optimize black-box functions. Traditionally, it focuses on the setting where you can arbitrarily query the search space. However, many real-life problems do not offer this flexibility; in particular, the search space of the next query may depend on previous ones. Example challenges arise in the physical sciences in the form of local movement constraints, required monotonicity in certain variables, and transitions influencing the accuracy of measurements. Altogether, such transition constraints necessitate a form of planning. This work extends classical Bayesian optimization via the framework of Markov Decision Processes. We iteratively solve a tractable linearization of our utility function using reinforcement learning to obtain a policy that plans ahead for the entire horizon. This is a parallel to the optimization of an acquisition function in policy space. The resulting policy is potentially history-dependent and non-Markovian. We showcase applications in chemical reactor optimization, informative path planning, machine calibration, and other synthetic examples.
Paper Structure (46 sections, 4 theorems, 90 equations, 18 figures, 2 tables)

This paper contains 46 sections, 4 theorems, 90 equations, 18 figures, 2 tables.

Key Result

Proposition C.1

Assuming episodic feedback, and suppose that for any $\mathcal{Z}$, we can show that the Algorithm alg: movement_bo_via_mdps satisfied for the sequences of iterates $\{\hat{d}_t\}_{t=1}^T$: with probability $1-\delta$ on the sampling from the Markov chain. The randomness on the confidence set is captured by Assumption in Eq. eq: decrease.

Figures (18)

  • Figure 1: Representative task of finding pollution in a river while following the current. (a) Problem formulation: The star represents the maximizer and the arrows the Markov dynamics. (b) Objective formulation: Orange balls represent potential maximizers, with size corresponding to model uncertainty. (c) Optimization: Deploy a potentially stochastic policy that minimizes our objective.
  • Figure 2: The Knorr pyrazole synthesis experiment. On the left, we show the quantitative results. The line plots denote the best prediction regret, while the bar charts denote the percentage of runs that correctly identify the best arm at the end of each episode. On the right, we show ten paths in different colours chosen by the algorithm. The underlying black-box function is shown as the contours, and we can see the discretization as dots. We can see four remaining potential maximizers (in orange), which includes the true one (star). Notice all paths are non-decreasing in residence time, following the transition constraints.
  • Figure 3: Results for Ypacarai and free electron-laser tuning experiments. On the left, the line plots denote the best prediction regret, while the bar charts denote the percentage of runs that correctly identify the best arm at the end of each episode. On the right, We plot the regret and compare against standard BO without accounting for movement-dependent noise.
  • Figure 4: Results of experiments on the asynchronous and synchronous benchmarks. We plot the median predictive regret and the 10% and 90% quantiles. For the asynchronous experiments, we can see that the paths taken by MDP-BO-TS are more consistent, and the final performance is comparable to TrSnAKe. While in the asynchronous setting, we found creating the maximization set using Thompson Sampling gave a stronger performance, in the synchronous setting, UCB is preferred. LSR gives a very strong performance, comparable to MDP-BO-UCB in almost all benchmarks.
  • Figure 5: Visual abstract of the work. In black we show the method presented in this paper, with literature connections shown in blue. In red we show solutions which we did not pursue due to intractability. The problem creates the (a) need to plan ahead. To do this, we take inspiration from hypothesis testing and focus on (b) the variance reduction in a set of maximizers, which leads to our (c) acquisition function. The objective is the same as fiez2019sequential introduced in the linear bandits literature from a frequentist perspective. To optimize it, we follow developments in Hazan2019mutny2023active by (d) relaxing the acquisition function to the space of state-action distributions and (e) solving the planning problem using the Frank-Wolfe algorithm. This consists of iteratively solving tractable (f) reinforcement learning sub-problems which give us optimal Markov policies. We then apply adaptive resampling to obtain (g) non-Markovian policies.
  • ...and 13 more figures

Theorems & Definitions (9)

  • Proposition C.1
  • proof : Proof of Proposition \ref{['prop :theory']}
  • Lemma D.1: Additivity of Best-arm Objective
  • proof
  • Lemma D.2: Sherman-Morrison-Woodbury (SMW)
  • Lemma D.3: Matrix Inversion Lemma
  • proof
  • Remark D.4
  • proof