Meta-Prompt Optimization for LLM-Based Sequential Decision Making
Mingze Kong, Zhiyong Wang, Yao Shu, Zhongxiang Dai
TL;DR
This paper tackles the challenge of automatically optimizing meta-prompts for LLM-based sequential decision-making under non-stationary rewards. It introduces EXPO, an EXP3-inspired adversarial-bandit algorithm to jointly optimize the task description $\mathcal{D}$ and meta-instruction $\mathcal{I}$, with an extension EXPO-ES that also optimizes exemplar histories $\mathcal{E}$. Through experiments on Linear Regression, Traveling Salesman Problem, and LLM-based MAB tasks, EXPO shows significant improvements in convergence and final performance over fixed prompts and enhanced baselines, while EXPO-ES offers additional gains when exemplar information is informative. The work highlights adversarial-bandit formulations as a natural and effective framework for non-stationary prompt optimization in real-world, interactive LLM systems.
Abstract
Large language models (LLMs) have recently been employed as agents to solve sequential decision-making tasks such as Bayesian optimization and multi-armed bandits (MAB). These works usually adopt an LLM for sequential action selection by providing it with a fixed, manually designed meta-prompt. However, numerous previous works have found that the prompt has a significant impact on the performance of the LLM, which calls for a method to automatically optimize the meta-prompt for LLM-based agents. Unfortunately, the non-stationarity in the reward observations during LLM-based sequential decision-making makes meta-prompt optimization highly challenging. To address this challenge, we draw inspirations from adversarial bandit algorithms, which are inherently capable of handling non-stationary reward observations. Building on this foundation, we propose our EXPonential-weight algorithm for prompt Optimization} (EXPO) to automatically optimize the task description and meta-instruction in the meta-prompt for LLM-based agents. We also extend EXPO to additionally optimize the exemplars (i.e., history of interactions) in the meta-prompt to further enhance the performance, hence introducing our EXPO-ES algorithm. We use extensive experiments to show that our algorithms significantly improve the performance of LLM-based sequential decision-making.
