Table of Contents
Fetching ...

Reinforced In-Context Black-Box Optimization

Lei Song, Chenxiao Gao, Ke Xue, Chenyang Wu, Dong Li, Jianye Hao, Zongzhang Zhang, Chao Qian

TL;DR

RIBBO is proposed, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion to augment the optimization histories with regret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories.

Abstract

Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with \textit{regret-to-go} tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that satisfy the user-desired regret, which is verified by its universally good empirical performance on diverse problems, including BBO benchmark functions, hyper-parameter optimization and robot control problems.

Reinforced In-Context Black-Box Optimization

TL;DR

RIBBO is proposed, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion to augment the optimization histories with regret-to-go tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories.

Abstract

Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories with \textit{regret-to-go} tokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that satisfy the user-desired regret, which is verified by its universally good empirical performance on diverse problems, including BBO benchmark functions, hyper-parameter optimization and robot control problems.
Paper Structure (25 sections, 5 equations, 8 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 5 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of RIBBO. Left: Data Generation.$K$ existing BBO algorithms $\{\mathcal{A}_j\}_{j=1}^K$ and $N$ BBO tasks $\{f_i\}_{i=1}^N$ are used to serve as the behavior algorithms and the training tasks, respectively. The offline datasets $\{\mathcal{D}_{i,j}\}$ consist of the optimization histories $\bm h_T=\{(\bm x_t, y_t)\}_{t=1}^T$ collected by executing each behavior algorithm $\mathcal{A}_j$ on each task $f_i$ for $T$ evaluation steps, which are then augmented with the regret-to-go tokens $R_t$ (calculated as the cumulative regret $\sum^T_{t'=t+1} (y^*-y_{t'})$ over the future optimization history) to generate the final dataset $\{\widehat{\mathcal{D}}_{i,j}\}$ for training. Right: Training and Inference. Our model takes in triplets of $(\bm{x}_t, y_t, R_t)$, embeds them into one token, and outputs the distribution over the next query point $\bm x_{t+1}$. During training, the ground-truth next query point is used to minimize the loss in Eq. (\ref{['eq:ribbo_objective']}). During inference, the next query point $\bm x_{t+1}$ is generated auto-regressively based on the current history $\hat{\bm h}_t$.
  • Figure 2: Performance comparison among RIBBO, BC, BC Filter, OptFormer, and behavior algorithms on synthetic functions, HPO, and robot control problems. The $y$-axis is the normalized average objective value, and the length of vertical bars represents the standard deviation.
  • Figure 3: (a) Visualization of the contour lines of $2$D Branin function and sampling points of RIBBO (red), Eagle Strategy (orange), and Random Search (gray), where the arrows represent the optimization trajectory of RIBBO. (b) Generalization by training the model across all $24$ BBOB synthetic functions simultaneously. The results across functions with different output scaling are normalized to obtain the aggregate results. The legend shares with that of Figure \ref{['fig:main_exp']}. (c) Initial RTG$R_0$'s influence on performance. (d) RTG update strategy comparison between HRR and the naive strategy with various initial RTG $R_0$.
  • Figure 4: Comparison of the behavior algorithms with the OptFormer re-implementation.
  • Figure 5: Cross-distribution generalization by training on $4$ of $5$ chosen synthetic functions and testing on the remaining one.
  • ...and 3 more figures