Table of Contents
Fetching ...

PRL: Prompts from Reinforcement Learning

Paweł Batorski, Adrian Kosmala, Paul Swoboda

TL;DR

PRL presents a reinforcement-learning based framework for automatic prompt generation that can create novel few-shot examples unseen during training and includes an explicit reasoning phase. The method uses a trainable Prompt Generator and a frozen Evaluation Model, optimized with GRPO and a prompt selection strategy to maximize task performance across classification, summarization, and simplification. Across benchmarks, PRL achieves state-of-the-art results, notably large gains on SUBJ and ROUGE/SARI metrics, and demonstrates that better prompts remain beneficial even for very large models, albeit with higher compute costs. The findings highlight the emergence of few-shot prompting behavior from RL optimization and offer a scalable route to automated, task-specific prompting.

Abstract

Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .

PRL: Prompts from Reinforcement Learning

TL;DR

PRL presents a reinforcement-learning based framework for automatic prompt generation that can create novel few-shot examples unseen during training and includes an explicit reasoning phase. The method uses a trainable Prompt Generator and a frozen Evaluation Model, optimized with GRPO and a prompt selection strategy to maximize task performance across classification, summarization, and simplification. Across benchmarks, PRL achieves state-of-the-art results, notably large gains on SUBJ and ROUGE/SARI metrics, and demonstrates that better prompts remain beneficial even for very large models, albeit with higher compute costs. The findings highlight the emergence of few-shot prompting behavior from RL optimization and offer a scalable route to automated, task-specific prompting.

Abstract

Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .

Paper Structure

This paper contains 20 sections, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Left: Illustration of our RL-based prompt optimization cycle, showing the iterative process of prompt generation, evaluation, and refinement. Right: Comparison of prompt engineering methods, highlighting that PRL not only automates prompt generation and refinement but also uniquely incorporates novel task-specific few-shot examples, resulting in superior overall performance. A yellow tilde ($\sim\mkern-14.5mu\sim$) is added for APO to indicate that, although it produces few-shot examples, they are sourced from the training data which significantly bounds the performance, whereas PRL generates entirely new instances unseen during training.
  • Figure 2: Training scheme of PRL. First, the Prompt Generator $\pi_{\theta}^{\text{generator}}$ generates a set of outputs $o_1, \ldots, o_n$ (reasoning + generated prompt) from which the corresponding prompts $p_1, \ldots, p_n$ are extracted. Each prompt is then evaluated by the Evaluation Model $\pi^{\text{eval}}$ (a language model with frozen parameters), which produces corresponding answers. These answers, along with the outputs from the Prompt Generator, are used to compute rewards $r_1, \ldots, r_{n}$. Finally, the rewards are used to update the parameters of the Prompt Generator through RL.
  • Figure 3: Prompt used by our model. In the system prompt, we instruct the model to generate a reasoning trace enclosed within <think> and </think> tokens, followed by the final answer encapsulated within <answer> and </answer> tokens. The user message provides the base prompt that it should refine. The model’s objective is to produce the prompt that is better than the best prompt.
  • Figure 4: Comparison of a manual instruction, the best PRL prompt, and the best EvoPrompt prompt along with their accuracies on SST-2 task.
  • Figure 5: Comparison of averaged ROUGE metrics based on prompts generated by PRL, EvoPrompt, and Manual Instruction for the summarization task. This figure highlights the importance of precise prompt design: although the two prompts generated by EvoPrompt on two different seeds are superficially similar, they result in significantly different performance. In contrast, the PRL prompt is both more effective and better aligned with the task objective.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Remark 1
  • Remark 2