Table of Contents
Fetching ...

Prompt reinforcing for long-term planning of large language models

Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić

TL;DR

This work tackles the difficulty of long-horizon planning in multi-turn interactions by introducing Reinforced Prompt Optimisation (RPO), a reinforcement-learning-inspired, parameter-free framework that iteratively updates the system instruction prompt using turn-level textual feedback and experience replay. By separating feedback generation (TD-style or Monte Carlo) from prompt rewriting, RPO enables LLM-based systems to improve strategic planning across tasks like Text-to-SQL, task-oriented dialogue, and medical QA, while staying compatible with both open-source and closed-source backbones. Empirical results across five LLMs show that TD-based feedback with replay yields substantial gains over baselines, though a gap remains relative to fully specified prompts, underscoring both the method’s practicality and its remaining challenges for unseen domains. The approach reduces inference overhead by avoiding frequent model parameter updates and demonstrates robust generalisability, opening avenues for reinforcement-learning-inspired prompt optimisation in real-time LLM applications.

Abstract

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

Prompt reinforcing for long-term planning of large language models

TL;DR

This work tackles the difficulty of long-horizon planning in multi-turn interactions by introducing Reinforced Prompt Optimisation (RPO), a reinforcement-learning-inspired, parameter-free framework that iteratively updates the system instruction prompt using turn-level textual feedback and experience replay. By separating feedback generation (TD-style or Monte Carlo) from prompt rewriting, RPO enables LLM-based systems to improve strategic planning across tasks like Text-to-SQL, task-oriented dialogue, and medical QA, while staying compatible with both open-source and closed-source backbones. Empirical results across five LLMs show that TD-based feedback with replay yields substantial gains over baselines, though a gap remains relative to fully specified prompts, underscoring both the method’s practicality and its remaining challenges for unseen domains. The approach reduces inference overhead by avoiding frequent model parameter updates and demonstrates robust generalisability, opening avenues for reinforcement-learning-inspired prompt optimisation in real-time LLM applications.

Abstract

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

Paper Structure

This paper contains 28 sections, 5 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: The structure of Reinforced Prompt Optimisation (RPO). The initial $\textit{prompt}^1$ can be generated by LLMs or written by experts. In interactive optimisation, the system will first interact with the environment, e.g., simulated or real users. The feedbacker, e.g., human experts or LLMs, will provide textual feedback based on trajectories. The rewriter generates a new prompt based on the original prompt and the textual feedback to update the system's original prompt. One cycle of interactive optimisation is called an epoch, and we use superscripts to denote the epoch number.
  • Figure 2: Workflow of feedback generation by an LLM. The Monte Carlo–style feedback (left) is generated after the entire interaction is completed, whereas the Temporal Difference–style feedback (right) consists of turn-level sub-feedback. Each sub-feedback includes a prediction of next-turn user satisfaction, a prediction of goal success, and an actionable suggestion.
  • Figure 3: The summary of our experiment tasks.
  • Figure 4: The training curves of different optimisation methods. Each setting is trained on 4 seeds and evaluated on 100 dialogues. The line is the average success and the shadow is the standard error.
  • Figure 5: Overall preference between our method and a standard system (Standard), GPO, and HuatuoGPT-II (Huatuo) on the medical question-answering task. The overall recommendation by human experts is based on safety, professionalism, and fluency.
  • ...and 7 more figures