Table of Contents
Fetching ...

RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents

Weizhe Chen, Sven Koenig, Bistra Dilkina

TL;DR

The paper addresses how prompt design critically shapes LLM agents' reasoning, especially in settings where a ground-truth final evaluator is costly or unavailable. It proposes RePrompt, a history-aware, gradient-descent-like automatic prompt optimizer that iteratively refines step-by-step instructions without requiring a ground-truth checker. Across PDDL generation, TravelPlanner, and Meeting Planning, RePrompt demonstrates improved performance, better handling of budget and commonsense constraints, and robustness to different feedback types. This approach enables more reliable LLM-agent behavior in practical scenarios where expensive evaluation or feedback limits the effectiveness of traditional APE methods.

Abstract

In the past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and their capacity is further expanded into the so-called LLM agents when connected with external tools. In all domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering (APE) has become an important question for many researchers and users of LLMs. However, previous works in APE rely on a final checker to evaluate the performance of the given prompt -- a requirement that is hard to meet in the case of LLM agents, where intermediate feedback is easier to obtain, and the final evaluation could be expensive, inaccurate, or even missing. In this paper, we propose a novel method, \textsc{RePrompt}, which does a ``gradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents, based on the chat history obtained from interactions and reflections with LLM agents. By leveraging intermediate feedback, \textsc{RePrompt} can optimize the prompt without the need for a final solution checker. We evaluate our approach on PDDL generation, TravelPlanner, and Meeting Planning to show that our method could generally improve performance for different reasoning tasks.

RePrompt: Planning by Automatic Prompt Engineering for Large Language Models Agents

TL;DR

The paper addresses how prompt design critically shapes LLM agents' reasoning, especially in settings where a ground-truth final evaluator is costly or unavailable. It proposes RePrompt, a history-aware, gradient-descent-like automatic prompt optimizer that iteratively refines step-by-step instructions without requiring a ground-truth checker. Across PDDL generation, TravelPlanner, and Meeting Planning, RePrompt demonstrates improved performance, better handling of budget and commonsense constraints, and robustness to different feedback types. This approach enables more reliable LLM-agent behavior in practical scenarios where expensive evaluation or feedback limits the effectiveness of traditional APE methods.

Abstract

In the past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and their capacity is further expanded into the so-called LLM agents when connected with external tools. In all domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering (APE) has become an important question for many researchers and users of LLMs. However, previous works in APE rely on a final checker to evaluate the performance of the given prompt -- a requirement that is hard to meet in the case of LLM agents, where intermediate feedback is easier to obtain, and the final evaluation could be expensive, inaccurate, or even missing. In this paper, we propose a novel method, \textsc{RePrompt}, which does a ``gradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents, based on the chat history obtained from interactions and reflections with LLM agents. By leveraging intermediate feedback, \textsc{RePrompt} can optimize the prompt without the need for a final solution checker. We evaluate our approach on PDDL generation, TravelPlanner, and Meeting Planning to show that our method could generally improve performance for different reasoning tasks.
Paper Structure (23 sections, 12 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: The workflow of our method RePrompt.
  • Figure 2: An example output of the optimizer LLM that outputs a prompt template instead of a complete prompt. While it is technically correct and successfully added the additional instruction shown in blue, this output is not acceptable since it includes a template holder for examples marked in red, and this output still needs post-processing with the original prompt to complete the prompt. For simplicity, we have omitted the parts of the original prompt that are not changed, marked in green, and this part of the prompt can be found in the original paper guan2023leveraging.
  • Figure 3: The prompt used to transform the output from Deepseek-R1 from 0-shot prompt to the desired format. The prompt should be fed into an LLM, which in the experiment is Deepseek-V3 to transform the output to meet the desired format of Natural Plan.
  • Figure 4: The checker prompt for PromptAgent. The prompt should be fed into an LLM, which in this paper is GPT4-turbo, to get the judgement of whether the current result is correct or wrong.
  • Figure 5: The loss summarize prompt. The prompt should be fed into an LLM, which in this paper is GPT4-turbo, to get the loss used to optimize the prompt.
  • ...and 7 more figures