Table of Contents
Fetching ...

Why is prompting hard? Understanding prompts on binary sequence predictors

Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein

TL;DR

This work viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources shows that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice.

Abstract

Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.

Why is prompting hard? Understanding prompts on binary sequence predictors

TL;DR

This work viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources shows that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice.

Abstract

Large language models (LLMs) can be prompted to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We explore these issues by viewing prompting as conditioning a near-optimal sequence predictor (LLM) pretrained on diverse data sources. Through numerous prompt search experiments, we show that the unintuitive patterns in optimal prompts can be better understood given the pretraining distribution, which is often unavailable in practice. Moreover, even using exhaustive search, reliably identifying optimal prompts from practical neural predictors can be difficult. Further, we demonstrate that common prompting methods, such as using intuitive prompts or samples from the targeted task, are in fact suboptimal. Thus, this work takes an initial step towards understanding the difficulties in finding and understanding optimal prompts from a statistical and empirical perspective.

Paper Structure

This paper contains 68 sections, 1 theorem, 28 equations, 28 figures, 4 tables, 1 algorithm.

Key Result

Proposition 1.2

The deterministic prompt distribution $\nu({\mathbf{s}}|\tau)=\delta_{{\mathbf{s}}^*(\tau)}$ centered at for all $\tau$ maximizes $\mathop{\mathrm{\mathrm{MAMI}}}\nolimits({\mathbf{s}};{\mathbf{x}})$.

Figures (28)

  • Figure 1: Experimental design to obtain optimal prompts under pretraining and task data generators.
  • Figure 2: Results for a pretraining DG $p=\mathop{\mathrm{\mathrm{BernMix}}}\nolimits(0.2, 0.7)$ and a task DG $q=\mathop{\mathrm{\mathrm{Bern}}}\nolimits(0.7)$. Left, the proportion correct for Bayes predictor ($10^3$ seeds per data point) and two neural predictors (30 seeds per data point). Error bar show 1 SEM. The black dotted line is the theoretical value for $T=1$ (see \ref{['sec:bernmix_bern_0.7_nonmonotonic']}). Additional results are in \ref{['sec:bernmix_bern_0.7_correct']}. Right, empirically optimal prompts at ${L_{\text{max}}}=5$ for the Bayes predictor for different values of $T$ (colors) and $N$ (panels); 100 repetitions per setting. The counts of zeros and ones are jittered. The cyan cross shows the all-one theoretical $s^*$.
  • Figure 3: Results for $p=\mathop{\mathrm{\mathrm{BernMix}}}\nolimits(0.2, 0.7)$ and $q=\mathop{\mathrm{\mathrm{Bern}}}\nolimits(0.6)$. Left, each circle represents the heads/tails count of the theoretically optimal prompts. The orange dotted line indicates the maximum prompt length ${L_{\text{max}}}$. The green dashed line marks 60% ones in $s^*$. Right, the proportion correct of $\hat{s}=s^*$. \ref{['fig:sensitivity_0.6_supp']} shows additional results.
  • Figure 4: Results for $p=\mathop{\mathrm{\mathrm{BetaBern}}}\nolimits(1, \beta)$ with $\beta\in\{1, 2\}$, and $q=\mathop{\mathrm{\mathrm{Bern}}}\nolimits(\tau)$ with $\tau\in\{0.7, 0.9\}$. Left, the ratio of ones in the theoretical $s^*$. Red dotted line shows true bias of $q$. Right, proportion correct of $\hat{s}$ for $\beta=1$. \ref{['fig:beta_categorical_supp']} has more results.
  • Figure 5: Results for a random switching pretraining DG and four switching downstream DGs with different causes $(\varepsilon, \lambda)$ (rows). Left two columns, example theoretical prompts $s^*$'s (dots), the true switching latent bias $y$ (orange solid), and heuristic estimates of the latent $y$ based on $s^*$. Middle three columns: the proportion correct of $\hat{s}$; \ref{['fig:switching_supp']} shows additional results. Right two columns, estimated log-loss of prompting the Bayes predictor using typical prompts from $q$ of increasing length (blue line), compared to using the theoretical $s^*$ with length $15$ (red cross and dotted line)
  • ...and 23 more figures

Theorems & Definitions (8)

  • Definition 3.1: Bernoulli
  • Definition 3.2: Bernoulli mixture
  • Definition 3.3: Beta-Bernoulli
  • Definition 5.1: Switching Process
  • Definition 5.2: Random Switching Process
  • Definition 1.1
  • Proposition 1.2
  • proof