This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization

Anthony Bardou; Patrick Thiran; Giovanni Ranieri

This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization

Anthony Bardou, Patrick Thiran, Giovanni Ranieri

TL;DR

This paper designs a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions and uses this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon.

Abstract

Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function $f : \mathcal{S} \to \mathbb{R}$. However, optimizing a black-box which is also a function of time (i.e., a dynamic function) $f : \mathcal{S} \times \mathcal{T} \to \mathbb{R}$ remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in $\mathcal{S} \times \mathcal{T}$ is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.

This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization

TL;DR

Abstract

Bayesian Optimization (BO) has proven to be very successful at optimizing a static, noisy, costly-to-evaluate black-box function

. However, optimizing a black-box which is also a function of time (i.e., a dynamic function)

remains a challenge, since a dynamic Bayesian Optimization (DBO) algorithm has to keep track of the optimum over time. This changes the nature of the optimization problem in at least three aspects: (i) querying an arbitrary point in

is impossible, (ii) past observations become less and less relevant for keeping track of the optimum as time goes by and (iii) the DBO algorithm must have a high sampling frequency so it can collect enough relevant observations to keep track of the optimum through time. In this paper, we design a Wasserstein distance-based criterion able to quantify the relevancy of an observation with respect to future predictions. Then, we leverage this criterion to build W-DBO, a DBO algorithm able to remove irrelevant observations from its dataset on the fly, thus maintaining simultaneously a good predictive performance and a high sampling frequency, even in continuous-time optimization tasks with unknown horizon. Numerical experiments establish the superiority of W-DBO, which outperforms state-of-the-art methods by a comfortable margin.

Paper Structure (49 sections, 15 theorems, 98 equations, 22 figures, 5 tables, 1 algorithm)

This paper contains 49 sections, 15 theorems, 98 equations, 22 figures, 5 tables, 1 algorithm.

Introduction
Background
A Wasserstein Distance-Based Criterion
Core Assumptions
Measuring the Relevancy of an Observation
Using the Criterion in Practice
Computational Tractability
W-DBO
Numerical Results
Sensitivity Analysis
Comparison with Baselines
Conclusion
Wasserstein Distance at a Point in $\mathcal{F}_{t_0}$
Wasserstein Distance on $\mathcal{F}_{t_0}$
Approximation Error
...and 34 more sections

Key Result

Theorem 4.1

Let $t_0$ be the present time and $\mathcal{D} = \left\{((\bm x_i, t_i), y_i)\right\}_{i \in \llbracket1, n\rrbracket}$ be a dataset of observations made before $t_0$. Let $\Tilde{\mathcal{D}} = \left\{((\bm x_i, t_i), y_i)\right\}_{i \in \llbracket2, n\rrbracket}$ be the dataset without the first o where $C(\mathcal{X}, \mathcal{Y}) = \left((k_S * k_S)(\bm x_j - \bm x_i) \cdot (k_T * k_T)_{t_0 -

Figures (22)

Figure 1: Similar values of Wasserstein distance, different effect on posteriors. For visualization purposes, only the posterior means of two posterior GPs (blue for $\mu_\mathcal{D}$ and orange for $\mu_{\Tilde{\mathcal{D}}}$) are depicted, along a single dimension (e.g., time). The Wasserstein distance between the two posteriors is shown by the green shaded area. The GPs have a small lengthscale (left) or, conversely, a large lengthscale (right) for the chosen dimension.
Figure 2: Normalized Wasserstein distances. Similarly to Figure \ref{['fig:wasserstein_absolute']}, a few couples of GP posterior means $(\mu_\mathcal{D}, \mu_{\Tilde{\mathcal{D}}})$ are depicted. The top (resp., bottom) row depicts couples of posteriors that yield a small (resp., large) ratio \ref{['eq:wasserstein-relative']}. The left (resp., right) column depicts couples of posteriors controlled by a small (resp., large) lengthscale. The prior GP mean $\mu_\emptyset = 0$ is shown as a black dashed line, and the Wasserstein distance between the posterior and the prior as a gray shaded area.
Figure 3: (Left) Sensitivity analysis on the Eggholder function. (Right) Aggregation of sensitivity analyses of W-DBO made on 10 synthetic functions and a real-world experiment. For aggregation purposes, the average regrets in each experiment have been normalized between 0 (lowest average regret) and 1 (largest average regret). The average performance of W-DBO over all the experiments is shown in black. Standard errors are depicted with colored bars (left) and shaded areas (right).
Figure 4: (Left) Average regrets of the DBO solutions during the optimization of the Ackley synthetic function. (Right) Dataset sizes of the DBO solutions during the optimization of the Ackley function.
Figure 5: Visual summary of the results reported in Table \ref{['tab:results']}. For aggregation purposes, the average regrets in each experiment have been normalized between 0 (lowest average regret) and 1 (largest average regret). The average performance of the DBO solutions is shown in black.
...and 17 more figures

Theorems & Definitions (28)

Theorem 4.1
Lemma A.1
proof
Proposition A.2
Proposition A.3
proof
proof
Lemma B.1
proof
Lemma B.2
...and 18 more

This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization

TL;DR

Abstract

This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (28)