Prompt reinforcing for long-term planning of large language models
Hsien-Chin Lin, Benjamin Matthias Ruppik, Carel van Niekerk, Chia-Hao Shen, Michael Heck, Nurul Lubis, Renato Vukovic, Shutong Feng, Milica Gašić
TL;DR
This work tackles the difficulty of long-horizon planning in multi-turn interactions by introducing Reinforced Prompt Optimisation (RPO), a reinforcement-learning-inspired, parameter-free framework that iteratively updates the system instruction prompt using turn-level textual feedback and experience replay. By separating feedback generation (TD-style or Monte Carlo) from prompt rewriting, RPO enables LLM-based systems to improve strategic planning across tasks like Text-to-SQL, task-oriented dialogue, and medical QA, while staying compatible with both open-source and closed-source backbones. Empirical results across five LLMs show that TD-based feedback with replay yields substantial gains over baselines, though a gap remains relative to fully specified prompts, underscoring both the method’s practicality and its remaining challenges for unseen domains. The approach reduces inference overhead by avoiding frequent model parameter updates and demonstrates robust generalisability, opening avenues for reinforcement-learning-inspired prompt optimisation in real-time LLM applications.
Abstract
Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.
