Table of Contents
Fetching ...

Language Model Embeddings Can Be Sufficient for Bayesian Optimization

Tung Nguyen, Qiuyi Zhang, Bangding Yang, Chansoo Lee, Jorg Bornschein, Yingjie Miao, Sagi Perel, Yutian Chen, Xingyou Song

TL;DR

This paper introduces embed-then-regress, a framework that uses language-model embeddings of string representations to perform in-context regression within Bayesian Optimization. By freezing a pretrained string embedder and training a Transformer Neural Process as the regressor, the approach yields calibrated predictions and competitive optimization performance against GP-based baselines across synthetic, combinatorial, and hyperparameter tuning tasks. Pretraining on diverse offline evaluations enables generalization to unseen objective functions, suggesting broad applicability beyond traditional tabular inputs. The work highlights the potential of flexible, string-based representations to expand the scope and efficiency of black-box optimization in real-world settings.

Abstract

Bayesian Optimization is ubiquitous in experimental design and black-box optimization for improving search efficiency. However, most existing approaches rely on regression models which are limited to fixed search spaces and structured, tabular input features. This paper explores the use of LLM embeddings over string inputs for in-context regression in Bayesian Optimization. Our results show that representing inputs as strings enables general-purpose regression across diverse domains, including synthetic, combinatorial, and hyperparameter optimization. Furthermore, our approach achieves optimization performance comparable to state-of-the-art Gaussian Process-based methods such as Google Vizier, and demonstrates potential for broader and more flexible applications.

Language Model Embeddings Can Be Sufficient for Bayesian Optimization

TL;DR

This paper introduces embed-then-regress, a framework that uses language-model embeddings of string representations to perform in-context regression within Bayesian Optimization. By freezing a pretrained string embedder and training a Transformer Neural Process as the regressor, the approach yields calibrated predictions and competitive optimization performance against GP-based baselines across synthetic, combinatorial, and hyperparameter tuning tasks. Pretraining on diverse offline evaluations enables generalization to unseen objective functions, suggesting broad applicability beyond traditional tabular inputs. The work highlights the potential of flexible, string-based representations to expand the scope and efficiency of black-box optimization in real-world settings.

Abstract

Bayesian Optimization is ubiquitous in experimental design and black-box optimization for improving search efficiency. However, most existing approaches rely on regression models which are limited to fixed search spaces and structured, tabular input features. This paper explores the use of LLM embeddings over string inputs for in-context regression in Bayesian Optimization. Our results show that representing inputs as strings enables general-purpose regression across diverse domains, including synthetic, combinatorial, and hyperparameter optimization. Furthermore, our approach achieves optimization performance comparable to state-of-the-art Gaussian Process-based methods such as Google Vizier, and demonstrates potential for broader and more flexible applications.

Paper Structure

This paper contains 17 sections, 4 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Using language models, we embed string representations of search space candidates as features for downstream regression.
  • Figure 2: Overview of our model. Most notably, candidates $x$ are converted to language model embeddings to be ultimately used as fixed dimensional features.
  • Figure 3: ($\downarrow$) Lower is better. Mean optimality gap curves across 9 randomized test functions, some with non-continuous parameters. Note: $y$-axis is log-scaled to depict clearer separation between baselines.
  • Figure 4: ($\uparrow$) Higher is better. Best-so-far curves across 8 randomized combinatorial problems. Title parenthesis $(P)$ means a permutation space of size $P$ and $(N, K)$ denotes a ${N \choose K}$ choice space. Note that plotting begins at trial 20, since previous trials are random.
  • Figure 5: ($\uparrow$) Higher is better. Best-so-far curves over 8 randomly chosen hyperparameter surrogate functions. Title contains task summary along with number of parameters $(\#P)$. Normalized objective values are displayed since raw objective values over large ranges, e.g. $y \in [-10^{7}, 10^{7}]$ from private functions would lack meaningful context.
  • ...and 3 more figures