Table of Contents
Fetching ...

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling

Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang, Nicholas Roy, Chuchu Fan

TL;DR

PROMST addresses the challenge of optimizing prompts for multi-step, LLM-driven tasks by integrating human-designed feedback rules with a learnable score-prediction model to guide offline prompt generation. The framework alternates between TaskLLM-driven task execution and PromptLLM-driven prompt generation, using summarized feedback to produce candidates and a score model to prune low-potential prompts. Across 11 diverse environments and five LLMs, PROMST achieves consistent improvements over strong baselines, with ablations highlighting the importance of human feedback, SumLLM, and the score predictor for efficiency and alignment. The work provides a benchmark-ready approach and resources that can catalyze automatic prompt optimization for complex, real-world, multi-step tasks.

Abstract

Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework PRompt Optimization in Multi-Step Tasks (PROMST) that incorporates human-designed feedback rules to automatically offer direct suggestions for improvement. We also use an extra learned heuristic model that predicts prompt performance to efficiently sample from prompt candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across 11 representative multi-step tasks (an average 10.6\%-29.3\% improvement to current best methods on five LLMs respectively). We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks. Datasets and Codes are available at https://github.com/yongchao98/PROMST. Project Page is available at https://yongchao98.github.io/MIT-REALM-PROMST.

PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Heuristic-based Sampling

TL;DR

PROMST addresses the challenge of optimizing prompts for multi-step, LLM-driven tasks by integrating human-designed feedback rules with a learnable score-prediction model to guide offline prompt generation. The framework alternates between TaskLLM-driven task execution and PromptLLM-driven prompt generation, using summarized feedback to produce candidates and a score model to prune low-potential prompts. Across 11 diverse environments and five LLMs, PROMST achieves consistent improvements over strong baselines, with ablations highlighting the importance of human feedback, SumLLM, and the score predictor for efficiency and alignment. The work provides a benchmark-ready approach and resources that can catalyze automatic prompt optimization for complex, real-world, multi-step tasks.

Abstract

Prompt optimization aims to find the best prompt to a large language model (LLM) for a given task. LLMs have been successfully used to help find and improve prompt candidates for single-step tasks. However, realistic tasks for agents are multi-step and introduce new challenges: (1) Prompt content is likely to be more extensive and complex, making it more difficult for LLMs to analyze errors, (2) the impact of an individual step is difficult to evaluate, and (3) different people may have varied preferences about task execution. While humans struggle to optimize prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization framework PRompt Optimization in Multi-Step Tasks (PROMST) that incorporates human-designed feedback rules to automatically offer direct suggestions for improvement. We also use an extra learned heuristic model that predicts prompt performance to efficiently sample from prompt candidates. This approach significantly outperforms both human-engineered prompts and several other prompt optimization methods across 11 representative multi-step tasks (an average 10.6\%-29.3\% improvement to current best methods on five LLMs respectively). We believe our work can serve as a benchmark for automatic prompt optimization for LLM-driven multi-step tasks. Datasets and Codes are available at https://github.com/yongchao98/PROMST. Project Page is available at https://yongchao98.github.io/MIT-REALM-PROMST.
Paper Structure (25 sections, 6 equations, 14 figures, 10 tables, 3 algorithms)

This paper contains 25 sections, 6 equations, 14 figures, 10 tables, 3 algorithms.

Figures (14)

  • Figure 1: The PROMST framework. Given an initial human-designed prompt and the state of the environment for the current task, the TaskLLM iteratively generates an action and executes it until either an error occurs or the task is complete. Human-designed feedback rules automatically generate feedback about errors that is then provided as context to the PromptLLM when generating new prompt candidates. The task performance is scored according to a human-designed score function; this score can be used with the prompt to train a score prediction model online. Given new prompt candidates, this score prediction model is used to select a subset of candidates to evaluate for the next generation.
  • Figure 2: Eight examples of human-designed feedback templates. The blue-colored text represents the content specific to each instance of an error.
  • Figure 3: An illustration of the 11 environments used for multi-step task evaluation. See Appendix \ref{['appendix sec: Testing multi-step envs']} for more details.
  • Figure 4: Several results inspecting the learned score prediction model. (a) The distribution/ratio of prompt scores with/without the score prediction model. (b) The prediction error of the model on the training data and heldout test data as the amount of training data increases. (c) A plot of the predicted score vs the actual score for various prompts; blue are the prompts that were chosen as parents for new candidates. (d) The trend of the best performing prompt during optimization for increasing iterations both with and without using the learned score prediction model.
  • Figure 5: (a) Comparison of score prediction errors for few-shot GPT-4 vs finetuning Longformer for increasing amount of few-shot examples or training data, respectively. (b) An ablation study of the impact of the human-designed feedback rules on task performance for four multi-step tasks.
  • ...and 9 more figures