Table of Contents
Fetching ...

Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization

Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Sercan O. Arik

TL;DR

It is found that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars, consistently improves performance on top of IO methods but is currently under-investigated, and a synergy between EO and IO, with optimal combinations surpassing the individual contributions.

Abstract

Large language models have demonstrated remarkable capabilities, but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar optimization, EO). Despite their shared objective, these have evolved rather independently, with IO receiving more research attention recently. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and EO techniques both isolation and combination on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars, consistently improves performance on top of IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with EO strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe a synergy between EO and IO, with optimal combinations surpassing the individual contributions. We conclude that studying exemplar optimization both as a standalone method and its optimal combination with instruction optimization remain a crucial aspect of APO and deserve greater consideration in future research, even in the era of highly capable instruction-following models.

Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization

TL;DR

It is found that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars, consistently improves performance on top of IO methods but is currently under-investigated, and a synergy between EO and IO, with optimal combinations surpassing the individual contributions.

Abstract

Large language models have demonstrated remarkable capabilities, but their performance is heavily reliant on effective prompt engineering. Automatic prompt optimization (APO) methods are designed to automate this and can be broadly categorized into those targeting instructions (instruction optimization, IO) vs. those targeting exemplars (exemplar optimization, EO). Despite their shared objective, these have evolved rather independently, with IO receiving more research attention recently. This paper seeks to bridge this gap by comprehensively comparing the performance of representative IO and EO techniques both isolation and combination on a diverse set of challenging tasks. Our findings reveal that intelligently reusing model-generated input-output pairs obtained from evaluating prompts on the validation set as exemplars, consistently improves performance on top of IO methods but is currently under-investigated. We also find that despite the recent focus on IO, how we select exemplars can outweigh how we optimize instructions, with EO strategies as simple as random search outperforming state-of-the-art IO methods with seed instructions without any optimization. Moreover, we observe a synergy between EO and IO, with optimal combinations surpassing the individual contributions. We conclude that studying exemplar optimization both as a standalone method and its optimal combination with instruction optimization remain a crucial aspect of APO and deserve greater consideration in future research, even in the era of highly capable instruction-following models.
Paper Structure (26 sections, 2 equations, 20 figures, 23 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 20 figures, 23 tables, 1 algorithm.

Figures (20)

  • Figure 1: Average performance over >20 tasks on PaLM 2 -- We compare and combine APO targeting exemplars and instructions, and find that how we optimize exemplars (orange) can eclipse how we optimize instructions despite current research favoring the latter (blue and purple), whereas optimizing both is the best (cyan) within similar budget.
  • Figure 2: An example prompt: instruction$I$ describes the task; HTML]FFF2CCexemplars ($e_1, ..., e_k$, $k=1$ in the figure) provide demonstrations and enable ICL; both are prepended to the query$x$ before receiving the LLM responses.
  • Figure 3: Appropriate EO improves over any or no IO: Task-specific BBH performance with no instruction optimization (left) and with SoTA IO: APE (middle) and ProTeGi (right) before and after applying exemplars found via Mutation (§\ref{['subsec:experimental_setup']}) on PaLM 2. Dashed and solid lines denote the average performance before and after exemplars, respectively. Task index is determined by the ascending order of test accuracy under seed instruction. Refer to additional visualization in App. \ref{['app:additional_visualization']}.
  • Figure 4: Optimized exemplars generalize better than optimized instructions. Comparison of validation accuracy and test accuracy over different model-task combinations. The generalization gap, which is the difference between validation and test accuracy, is marked on each figure. The better generalization of EO is exemplified by the smaller generalization gaps in all cases studied.
  • Figure 5: Task-specific BBH performance of selected IO-EO combinations with PaLM 2 (refer to App. \ref{['app:additional_visualization']} for all other models). Note that 1) Proper EO almost uniformly improves performance and 2) With appropriate exemplars, seed instructions with no optimization (third bar from the right) can often perform on par or better than SoTA IO but with standard random exemplars or no exemplars commonly used in the literature (first six bars in each figure).
  • ...and 15 more figures