Table of Contents
Fetching ...

Prompt Exploration with Prompt Regression

Michael Feffer, Ronald Xu, Yuekai Sun, Mikhail Yurochkin

TL;DR

This work addresses the challenge of systematizing prompt construction for large language models by introducing Prompt Exploration with Prompt Regression (PEPR). PEPR breaks prompt design into a regression step that predicts the effect of individual library elements on outputs and a subsequent selection step that assembles an effective prompt under a fixed budget, using either log-probability data or human preference signals. The approach relies on an independence of irrelevant alternatives assumption to keep the regression tractable, enabling efficient extrapolation from K observed prompts to 2^K−1 possible combinations, with two formulations: PEPR-R (log-probability) and PEPR-P (preference-based). Across multiple open-source LLMs and datasets (Toy, HateCheck, CAMEL, Natural Instructions) and several model sizes, PEPR-tuned prompts frequently outperform baselines and approach or reach the best possible configurations under limited evaluation budgets, though some libraries show that random selection can occasionally beat model-guided prompts. The work highlights PEPR’s potential to reduce brute-force search in prompt engineering, while noting limitations related to library quality and the linearity assumption, and it points to future directions including richer features, nonlinear models, and broader prompt components. The framework has practical implications for safer, more reliable, and scalable prompt optimization in real-world LLM deployments.

Abstract

In the advent of democratized usage of large language models (LLMs), there is a growing desire to systematize LLM prompt creation and selection processes beyond iterative trial-and-error. Prior works majorly focus on searching the space of prompts without accounting for relations between prompt variations. Here we propose a framework, Prompt Exploration with Prompt Regression (PEPR), to predict the effect of prompt combinations given results for individual prompt elements as well as a simple method to select an effective prompt for a given use-case. We evaluate our approach with open-source LLMs of different sizes on several different tasks.

Prompt Exploration with Prompt Regression

TL;DR

This work addresses the challenge of systematizing prompt construction for large language models by introducing Prompt Exploration with Prompt Regression (PEPR). PEPR breaks prompt design into a regression step that predicts the effect of individual library elements on outputs and a subsequent selection step that assembles an effective prompt under a fixed budget, using either log-probability data or human preference signals. The approach relies on an independence of irrelevant alternatives assumption to keep the regression tractable, enabling efficient extrapolation from K observed prompts to 2^K−1 possible combinations, with two formulations: PEPR-R (log-probability) and PEPR-P (preference-based). Across multiple open-source LLMs and datasets (Toy, HateCheck, CAMEL, Natural Instructions) and several model sizes, PEPR-tuned prompts frequently outperform baselines and approach or reach the best possible configurations under limited evaluation budgets, though some libraries show that random selection can occasionally beat model-guided prompts. The work highlights PEPR’s potential to reduce brute-force search in prompt engineering, while noting limitations related to library quality and the linearity assumption, and it points to future directions including richer features, nonlinear models, and broader prompt components. The framework has practical implications for safer, more reliable, and scalable prompt optimization in real-world LLM deployments.

Abstract

In the advent of democratized usage of large language models (LLMs), there is a growing desire to systematize LLM prompt creation and selection processes beyond iterative trial-and-error. Prior works majorly focus on searching the space of prompts without accounting for relations between prompt variations. Here we propose a framework, Prompt Exploration with Prompt Regression (PEPR), to predict the effect of prompt combinations given results for individual prompt elements as well as a simple method to select an effective prompt for a given use-case. We evaluate our approach with open-source LLMs of different sizes on several different tasks.
Paper Structure (33 sections, 12 equations, 10 figures, 6 tables)

This paper contains 33 sections, 12 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of PEPR utilization. After building a prompt library, the prompt regression step of PEPR uses the prompt library in conjunction with reference text or preference information to determine the influence of each prompt element on overall output. The second step, prompt selection, uses the parameters derived from prompt regression and data corresponding to desired behavior to select prompt elements in line with this behavior. Finally, the overall prompt is recovered from prompt selection.
  • Figure 2: Predicted-versus-true value plots corresponding to prompt regression experiments on two datasets, NI Task 195 and HateCheck. These in turn illustrate the ability of our prompt regression model to predict LLM outputs when prompt elements and corresponding coefficients are marginalized out of the model ( e.g., for a regression model reflecting the effects of 10 prompts, we aim to illustrate how its predictions of the prompt with elements 2 and 5 compare to ground-truth outputs). While there appear to be no clear trends across model size, PEPR-P appears to do the same or better than PEPR-R in most cases. Additionally, both versions of PEPR have low error and high correlation in general, suggesting that our assumption of the independence of irrelevant alternatives typically holds.
  • Figure 3: Prompt library for the Toy Dataset experiments. Note that the text "You are an AI assistant. Strictly adhere to the following rules:" was prepended to any chosen subset of prompts. (The second sentence was omitted for the unprompted baseline, but the first was used without any prompt elements.)
  • Figure 4: Prompt library for the HateCheck experiments. Note that the text "You are a hate speech detector. Given a piece of text, respond with "hateful" if it is offensive and "non-hate" if it is ok." was prepended to any chosen subset of prompts.
  • Figure 5: Prompt library for the Biology and Physics experiments.
  • ...and 5 more figures