Table of Contents
Fetching ...

Bayesian Optimization in Language Space: An Eval-Efficient AI Self-Improvement Framework

Enoch Hyunwook Kang, Hema Yoganarasimhan

TL;DR

This work addresses the bottleneck of evaluation efficiency in self-improving AI by reframing prompt optimization as language-space Bayesian Optimization. It introduces TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), which combines textual gradients (TextGrad) with Best-of-N gradient selection to mimic the UCB acquisition in language space without explicit surrogates. The authors prove that the Best-of-N gradient asymptotically aligns with the UCB gradient, yielding evaluation-efficient search, and validate the approach empirically on ad-optimization tasks using LLM-based persona simulations, where T-BoN BO outperforms state-of-the-art baselines like Best-of-N and GEPA. The results suggest that evaluation-efficient self-improvement can be achieved in practice, enabling faster and more robust alignment of AI-generated content with target user preferences across diverse scenarios, even with limited contextual information.

Abstract

Large Language Models (LLMs) have recently enabled self-improving AI, i.e., AI that iteratively generates, evaluates, and refines its own outcomes. Recent studies have shown that self-improving AI focusing on prompt optimization can outperform state-of-the-art reinforcement-learning fine-tuned LLMs. Here, their `performance' is typically measured by query efficiency - the number of LLM-generated solution samples required to meet a certain performance threshold. However, in many societal applications, the primary limitation is not generating new solutions but evaluating them. For instance, evaluating an ad's effectiveness requires significant human feedback, which is far more costly and time-consuming than generating a candidate ad. To optimize for the evaluation efficiency objective, a natural approach is to extend Bayesian Optimization (BO), a framework proven optimal for evaluation efficiency, to the language domain. However, the difficulty of directly estimating suitable acquisition functions in LLMs' minds makes this extension challenging. This paper overcomes this challenge by proving that the combination of the simple and widely used Best-of-N selection strategy and simple textual gradients (i.e., textual edits from a critic model) statistically emulates the behavior of the gradients on the canonical UCB acquisition function, which induces optimal exploration in terms of evaluation efficiency. Based on this result, we propose TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), a simple and eval-efficient language-space Bayesian optimization framework for AI self-improvement. We also empirically validate T-BoN BO by applying it to automated ad alignment tasks for persona distribution, demonstrating its superior performance compared to popular state-of-the-art baselines.

Bayesian Optimization in Language Space: An Eval-Efficient AI Self-Improvement Framework

TL;DR

This work addresses the bottleneck of evaluation efficiency in self-improving AI by reframing prompt optimization as language-space Bayesian Optimization. It introduces TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), which combines textual gradients (TextGrad) with Best-of-N gradient selection to mimic the UCB acquisition in language space without explicit surrogates. The authors prove that the Best-of-N gradient asymptotically aligns with the UCB gradient, yielding evaluation-efficient search, and validate the approach empirically on ad-optimization tasks using LLM-based persona simulations, where T-BoN BO outperforms state-of-the-art baselines like Best-of-N and GEPA. The results suggest that evaluation-efficient self-improvement can be achieved in practice, enabling faster and more robust alignment of AI-generated content with target user preferences across diverse scenarios, even with limited contextual information.

Abstract

Large Language Models (LLMs) have recently enabled self-improving AI, i.e., AI that iteratively generates, evaluates, and refines its own outcomes. Recent studies have shown that self-improving AI focusing on prompt optimization can outperform state-of-the-art reinforcement-learning fine-tuned LLMs. Here, their `performance' is typically measured by query efficiency - the number of LLM-generated solution samples required to meet a certain performance threshold. However, in many societal applications, the primary limitation is not generating new solutions but evaluating them. For instance, evaluating an ad's effectiveness requires significant human feedback, which is far more costly and time-consuming than generating a candidate ad. To optimize for the evaluation efficiency objective, a natural approach is to extend Bayesian Optimization (BO), a framework proven optimal for evaluation efficiency, to the language domain. However, the difficulty of directly estimating suitable acquisition functions in LLMs' minds makes this extension challenging. This paper overcomes this challenge by proving that the combination of the simple and widely used Best-of-N selection strategy and simple textual gradients (i.e., textual edits from a critic model) statistically emulates the behavior of the gradients on the canonical UCB acquisition function, which induces optimal exploration in terms of evaluation efficiency. Based on this result, we propose TextGrad-Best-of-N Bayesian Optimization (T-BoN BO), a simple and eval-efficient language-space Bayesian optimization framework for AI self-improvement. We also empirically validate T-BoN BO by applying it to automated ad alignment tasks for persona distribution, demonstrating its superior performance compared to popular state-of-the-art baselines.

Paper Structure

This paper contains 52 sections, 6 theorems, 60 equations, 18 figures, 3 tables, 1 algorithm.

Key Result

Lemma 1

Let $\xi_1,\dots,\xi_N$ be i.i.d. mean-zero, unit-variance Gaussian, $M_N=\max_{i\le N}\xi_i$, $S_N$ the second largest, and $q_N:=F^{-1}(1-1/N)$ the $(1-1/N)$-quantile of $\xi_1$. Then $q_N=\Theta(\sqrt{\ln N})$, $M_N-q_N\to 0$ in probability, and $M_N-S_N=O_p(1/q_N)$.

Figures (18)

  • Figure 1: Comparison of human-driven and AI-driven ad self-improvement cycles in digital advertising.
  • Figure 2: An example of a non-parallel UCB-BO procedure for four optimization steps. At each step, we choose the maximum (or multiple local maxima in the parallel UCB-BO) of the blue UCB function as the next point to evaluate.
  • Figure 3: Gradient steps of parallel gradient-based BO (Left image) and those of T-BoN BO (Right image). T-BoN BO extends parallel gradient-based BO to the language space.
  • Figure 4: Visual summary of T-BoN BO (Algorithm \ref{['algo:tbon']}) with $T$ iterations, $J$ trajectories, and $G$ Best-of-$N$ gradients per each iteration and trajectory.
  • Figure 5: An ad generation AI system $\Phi$ and prompt $\Pi=\langle \pi\rangle$.
  • ...and 13 more figures

Theorems & Definitions (13)

  • Definition 1: Self-improving AI algorithm
  • Lemma 1: Gaussian maxima and spacing vershynin2018high
  • Theorem 2: Best-of-$N$ asymptotically induces a UCB gradient direction
  • Lemma 3: Uniform first-order expansion
  • proof
  • Lemma 4: Max-stability under bounded perturbations
  • proof
  • Lemma 5: Spherical cap coverage
  • proof
  • Lemma 6: Near-maximum coupling in a thin band
  • ...and 3 more