Table of Contents
Fetching ...

Meta-Prompt Optimization for LLM-Based Sequential Decision Making

Mingze Kong, Zhiyong Wang, Yao Shu, Zhongxiang Dai

TL;DR

This paper tackles the challenge of automatically optimizing meta-prompts for LLM-based sequential decision-making under non-stationary rewards. It introduces EXPO, an EXP3-inspired adversarial-bandit algorithm to jointly optimize the task description $\mathcal{D}$ and meta-instruction $\mathcal{I}$, with an extension EXPO-ES that also optimizes exemplar histories $\mathcal{E}$. Through experiments on Linear Regression, Traveling Salesman Problem, and LLM-based MAB tasks, EXPO shows significant improvements in convergence and final performance over fixed prompts and enhanced baselines, while EXPO-ES offers additional gains when exemplar information is informative. The work highlights adversarial-bandit formulations as a natural and effective framework for non-stationary prompt optimization in real-world, interactive LLM systems.

Abstract

Large language models (LLMs) have recently been employed as agents to solve sequential decision-making tasks such as Bayesian optimization and multi-armed bandits (MAB). These works usually adopt an LLM for sequential action selection by providing it with a fixed, manually designed meta-prompt. However, numerous previous works have found that the prompt has a significant impact on the performance of the LLM, which calls for a method to automatically optimize the meta-prompt for LLM-based agents. Unfortunately, the non-stationarity in the reward observations during LLM-based sequential decision-making makes meta-prompt optimization highly challenging. To address this challenge, we draw inspirations from adversarial bandit algorithms, which are inherently capable of handling non-stationary reward observations. Building on this foundation, we propose our EXPonential-weight algorithm for prompt Optimization} (EXPO) to automatically optimize the task description and meta-instruction in the meta-prompt for LLM-based agents. We also extend EXPO to additionally optimize the exemplars (i.e., history of interactions) in the meta-prompt to further enhance the performance, hence introducing our EXPO-ES algorithm. We use extensive experiments to show that our algorithms significantly improve the performance of LLM-based sequential decision-making.

Meta-Prompt Optimization for LLM-Based Sequential Decision Making

TL;DR

This paper tackles the challenge of automatically optimizing meta-prompts for LLM-based sequential decision-making under non-stationary rewards. It introduces EXPO, an EXP3-inspired adversarial-bandit algorithm to jointly optimize the task description and meta-instruction , with an extension EXPO-ES that also optimizes exemplar histories . Through experiments on Linear Regression, Traveling Salesman Problem, and LLM-based MAB tasks, EXPO shows significant improvements in convergence and final performance over fixed prompts and enhanced baselines, while EXPO-ES offers additional gains when exemplar information is informative. The work highlights adversarial-bandit formulations as a natural and effective framework for non-stationary prompt optimization in real-world, interactive LLM systems.

Abstract

Large language models (LLMs) have recently been employed as agents to solve sequential decision-making tasks such as Bayesian optimization and multi-armed bandits (MAB). These works usually adopt an LLM for sequential action selection by providing it with a fixed, manually designed meta-prompt. However, numerous previous works have found that the prompt has a significant impact on the performance of the LLM, which calls for a method to automatically optimize the meta-prompt for LLM-based agents. Unfortunately, the non-stationarity in the reward observations during LLM-based sequential decision-making makes meta-prompt optimization highly challenging. To address this challenge, we draw inspirations from adversarial bandit algorithms, which are inherently capable of handling non-stationary reward observations. Building on this foundation, we propose our EXPonential-weight algorithm for prompt Optimization} (EXPO) to automatically optimize the task description and meta-instruction in the meta-prompt for LLM-based agents. We also extend EXPO to additionally optimize the exemplars (i.e., history of interactions) in the meta-prompt to further enhance the performance, hence introducing our EXPO-ES algorithm. We use extensive experiments to show that our algorithms significantly improve the performance of LLM-based sequential decision-making.

Paper Structure

This paper contains 36 sections, 20 equations, 19 figures, 2 algorithms.

Figures (19)

  • Figure 1: Illustration of our EXPO algorithm. We use purple to denote the task description and blue to represent the meta-instruction.
  • Figure 2: Results of different algorithms (mean $\pm$ standard error) in the Linear Regression and TSP task (Sec. \ref{['subsec:exp:opro']}). Lower is better.
  • Figure 3: The task description and meta-instruction used by OPRO (left) and optimized by our EXPO (right) in a Linear Regression task.
  • Figure 4: Cumulative regret of different algorithms in the LLM-based MAB experiments (Sec. \ref{['subsec:exp:bandits']}). Lower is better.
  • Figure 5: Results of our EXPO when only optimizing the task description or the meta-instruction.
  • ...and 14 more figures