Table of Contents
Fetching ...

ProRefine: Inference-Time Prompt Refinement with Textual Feedback

Deepak Pandita, Tharindu Cyril Weerasooriya, Ankit Parag Shah, Isabelle Diana May-Xin Ng, Christopher M. Homan, Wei Wei

TL;DR

ProRefine introduces an inference-time prompt-refinement loop that uses textual feedback from LLMs to dynamically adjust prompts in agentic workflows, improving multi-step reasoning without training data or fine-tuning. The method deploys three roles—$LLM_{task}$, $LLM_{feedback}$, and $LLM_{optimizer}$—to generate, critique, and refine prompts within a controlled iteration, with termination based on steps or EOS. Evaluations on five mathematical reasoning datasets show ProRefine substantially outperforms zero-shot CoT and, in many cases, TextGrad, with larger gains observed for bigger models and when a high-quality verifier is used. The approach reduces reliance on large-scale model deployment by enabling smaller models to approach the performance of larger ones, offering a practical pathway for cost-effective, hybrid AI systems. Limitations include inference-time cost, sensitivity to hyperparameters, and verifier accuracy, guiding future work on convergence, adaptive hyperparameters, and domain generalization.

Abstract

Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.

ProRefine: Inference-Time Prompt Refinement with Textual Feedback

TL;DR

ProRefine introduces an inference-time prompt-refinement loop that uses textual feedback from LLMs to dynamically adjust prompts in agentic workflows, improving multi-step reasoning without training data or fine-tuning. The method deploys three roles—, , and —to generate, critique, and refine prompts within a controlled iteration, with termination based on steps or EOS. Evaluations on five mathematical reasoning datasets show ProRefine substantially outperforms zero-shot CoT and, in many cases, TextGrad, with larger gains observed for bigger models and when a high-quality verifier is used. The approach reduces reliance on large-scale model deployment by enabling smaller models to approach the performance of larger ones, offering a practical pathway for cost-effective, hybrid AI systems. Limitations include inference-time cost, sensitivity to hyperparameters, and verifier accuracy, guiding future work on convergence, adaptive hyperparameters, and domain generalization.

Abstract

Agentic workflows, where multiple AI agents collaborate to accomplish complex tasks like reasoning or planning, play a substantial role in many cutting-edge commercial applications, and continue to fascinate researchers across fields for their potential to accomplish expensive, complex tasks that, until recently, only humans have been trusted to do. These workflows depend critically on the prompts used to provide the roles models play in such workflows. Poorly designed prompts that fail even slightly to guide individual agents can lead to sub-optimal performance that may snowball within a system of agents, limiting their reliability and scalability. To address this important problem of inference-time prompt optimization, we introduce ProRefine, an innovative inference-time optimization method that uses an agentic loop of LLMs to generate and apply textual feedback. ProRefine dynamically refines prompts for multi-step reasoning tasks without additional training or ground truth labels. Evaluated on five benchmark mathematical reasoning datasets, ProRefine significantly surpasses zero-shot Chain-of-Thought baselines by 3 to 37 percentage points. This approach not only boosts accuracy but also allows smaller models to approach the performance of their larger counterparts. This highlights its potential for building more cost-effective and powerful hybrid AI systems, thereby democratizing access to high-performing AI.

Paper Structure

This paper contains 32 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of ProRefine system, illustrating the iterative process of prompt optimization using feedback from LLMs. In each iteration, $LLM_{task}$ extends its output by an additional $k$ tokens, enabling step-by-step feedback to progressively refine the prompt with $LLM_{optimizer}$.
  • Figure 2: Test Accuracy [with 95% confidence interval] across different models and datasets. Llama3.1-70B-instruct is employed for feedback generation, prompt optimization, and evaluation.
  • Figure 3: Average number of prompt refinement iterations.
  • Figure 4: ProRefine example. Given an input query (which in this case has a correct answer: 13) and an initial prompt, the task model ($LLM_{task}$) gives an incorrect answer. ProRefine uses two additional models, $LLM_{feedback}$ and $LLM_{optimizer}$, to iteratively improve the prompt as the $LLM_{task}$ generates its response. Refining the prompt during generation allows the feedback model to target local regions of the response, providing finer-grained feedback than waiting for the response to complete. We also provide an additional example illustrating our approach in Figure \ref{['fig:neg_example']}.
  • Figure 5: This figure is an instance when $LLM_{optimizer}$ is not aligned with the feedback from $LLM_{feedback}$ and misses important guiding steps. The framework is similar to Figure \ref{['fig:example']} when $LLM_{optimizer}$ conveys feedback effectively. We've observed a few failed instances following this pattern.