Table of Contents
Fetching ...

Efficient Prompt Optimization Through the Lens of Best Arm Identification

Chengshuai Shi, Kun Yang, Zihan Chen, Jundong Li, Jing Yang, Cong Shen

TL;DR

This work tackles prompt optimization under explicit budget constraints by reframing the problem as fixed-budget best-arm identification (BAI-FB) within multi-armed bandits. The proposed TRIPLE framework leverages two core BAI-FB designs, SH and CR, and extends them with embedding-based enhancements (TRIPLE-CLST and TRIPLE-GSE) to scale to large candidate pools. Empirical results across multiple tasks and LLMs show that TRIPLE consistently outperforms standard baselines, and can be integrated into end-to-end prompt pipelines (APE, APO) to boost performance. The approach also extends to selecting few-shot examples, illustrating the framework’s versatility and potential to influence both prompt optimization and broader MAB research on correlated arms and contextual settings.

Abstract

The remarkable instruction-following capability of large language models (LLMs) has sparked a growing interest in automatically finding good prompts, i.e., prompt optimization. Most existing works follow the scheme of selecting from a pre-generated pool of candidate prompts. However, these designs mainly focus on the generation strategy, while limited attention has been paid to the selection method. Especially, the cost incurred during the selection (e.g., accessing LLM and evaluating the responses) is rarely explicitly considered. To overcome this limitation, this work provides a principled framework, TRIPLE, to efficiently perform prompt selection under an explicit budget constraint. TRIPLE is built on a novel connection established between prompt optimization and fixed-budget best arm identification (BAI-FB) in multi-armed bandits (MAB); thus, it is capable of leveraging the rich toolbox from BAI-FB systematically and also incorporating unique characteristics of prompt optimization. Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints. As an extension, variants of TRIPLE are proposed to efficiently select examples for few-shot prompts, also achieving superior empirical performance.

Efficient Prompt Optimization Through the Lens of Best Arm Identification

TL;DR

This work tackles prompt optimization under explicit budget constraints by reframing the problem as fixed-budget best-arm identification (BAI-FB) within multi-armed bandits. The proposed TRIPLE framework leverages two core BAI-FB designs, SH and CR, and extends them with embedding-based enhancements (TRIPLE-CLST and TRIPLE-GSE) to scale to large candidate pools. Empirical results across multiple tasks and LLMs show that TRIPLE consistently outperforms standard baselines, and can be integrated into end-to-end prompt pipelines (APE, APO) to boost performance. The approach also extends to selecting few-shot examples, illustrating the framework’s versatility and potential to influence both prompt optimization and broader MAB research on correlated arms and contextual settings.

Abstract

The remarkable instruction-following capability of large language models (LLMs) has sparked a growing interest in automatically finding good prompts, i.e., prompt optimization. Most existing works follow the scheme of selecting from a pre-generated pool of candidate prompts. However, these designs mainly focus on the generation strategy, while limited attention has been paid to the selection method. Especially, the cost incurred during the selection (e.g., accessing LLM and evaluating the responses) is rarely explicitly considered. To overcome this limitation, this work provides a principled framework, TRIPLE, to efficiently perform prompt selection under an explicit budget constraint. TRIPLE is built on a novel connection established between prompt optimization and fixed-budget best arm identification (BAI-FB) in multi-armed bandits (MAB); thus, it is capable of leveraging the rich toolbox from BAI-FB systematically and also incorporating unique characteristics of prompt optimization. Extensive experiments on multiple well-adopted tasks using various LLMs demonstrate the remarkable performance improvement of TRIPLE over baselines while satisfying the limited budget constraints. As an extension, variants of TRIPLE are proposed to efficiently select examples for few-shot prompts, also achieving superior empirical performance.
Paper Structure (33 sections, 1 equation, 16 figures, 7 tables, 6 algorithms)

This paper contains 33 sections, 1 equation, 16 figures, 7 tables, 6 algorithms.

Figures (16)

  • Figure 1: The commonly adopted prompt optimization pipeline. Previous works mostly investigate the generation component and ignore costs during selection, where GrIPS and APE are proposed in prasad2022gripszhou2022large. This work, instead, focuses on the selection component under an explicit budget constraint.
  • Figure 2: Performance comparisons of various prompt selection methods on the selected tasks. The reported results are aggregated over 20 independent runs. The full results on 47 tasks are reported in Appendix \ref{['app:add_res']}.
  • Figure 3: Relative gain of over Uniform under different budgets, collected with GPT-3.5.
  • Figure 4: The adopted system instructions: GPT-3.5 (left) and Llama2/Gemma/Mistral (right)
  • Figure 5: The adopted prompt generation templates for experiments with APE: forward (left) and backward (right)
  • ...and 11 more figures

Theorems & Definitions (2)

  • Remark 3.1
  • Definition E.1: f1-score