Table of Contents
Fetching ...

StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models

Minchan Kwon, Gaeun Kim, Jongsuk Kim, Haeil Lee, Junmo Kim

TL;DR

StablePrompt tackles the instability of reinforcement-learning–based prompt tuning for large language models by introducing Adaptive Proximal Policy Optimization (APPO) with an anchor model. The method redefines prompt search as an online, on-policy RL problem where an agent LLM generates prompts and a target LLM yields rewards from its responses; APPO stabilizes updates by constraining toward an adaptive anchor rather than a fixed previous policy. It also offers Test-Time Editing StablePrompt (TTE-StablePrompt) to create input-dependent prompts. Empirical results across few-shot classification, induction, and QA demonstrate strong, sometimes state-of-the-art, performance across diverse agent–target model pairs, including models larger than 7B. The work shows RL-based prompt tuning can be both stable and scalable for practical use with large LLMs, with implications for cost-efficient prompting and broader applicability in real-world NLP tasks.

Abstract

Finding appropriate prompts for the specific task has become an important issue as the usage of Large Language Models (LLM) has expanded. Reinforcement Learning (RL) is widely used for prompt tuning, but its inherent instability and environmental dependency make it difficult to use in practice. In this paper, we propose StablePrompt, which strikes a balance between training stability and search space, mitigating the instability of RL and producing high-performance prompts. We formulate prompt tuning as an online RL problem between the agent and target LLM and introduce Adaptive Proximal Policy Optimization (APPO). APPO introduces an LLM anchor model to adaptively adjust the rate of policy updates. This allows for flexible prompt search while preserving the linguistic ability of the pre-trained LLM. StablePrompt outperforms previous methods on various tasks including text classification, question answering, and text generation. Our code can be found in github.

StablePrompt: Automatic Prompt Tuning using Reinforcement Learning for Large Language Models

TL;DR

StablePrompt tackles the instability of reinforcement-learning–based prompt tuning for large language models by introducing Adaptive Proximal Policy Optimization (APPO) with an anchor model. The method redefines prompt search as an online, on-policy RL problem where an agent LLM generates prompts and a target LLM yields rewards from its responses; APPO stabilizes updates by constraining toward an adaptive anchor rather than a fixed previous policy. It also offers Test-Time Editing StablePrompt (TTE-StablePrompt) to create input-dependent prompts. Empirical results across few-shot classification, induction, and QA demonstrate strong, sometimes state-of-the-art, performance across diverse agent–target model pairs, including models larger than 7B. The work shows RL-based prompt tuning can be both stable and scalable for practical use with large LLMs, with implications for cost-efficient prompting and broader applicability in real-world NLP tasks.

Abstract

Finding appropriate prompts for the specific task has become an important issue as the usage of Large Language Models (LLM) has expanded. Reinforcement Learning (RL) is widely used for prompt tuning, but its inherent instability and environmental dependency make it difficult to use in practice. In this paper, we propose StablePrompt, which strikes a balance between training stability and search space, mitigating the instability of RL and producing high-performance prompts. We formulate prompt tuning as an online RL problem between the agent and target LLM and introduce Adaptive Proximal Policy Optimization (APPO). APPO introduces an LLM anchor model to adaptively adjust the rate of policy updates. This allows for flexible prompt search while preserving the linguistic ability of the pre-trained LLM. StablePrompt outperforms previous methods on various tasks including text classification, question answering, and text generation. Our code can be found in github.

Paper Structure

This paper contains 80 sections, 14 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Overview of StablePrompt. We formulate prompt tuning as an RL-framework using LLMs. We use the target LLM and the given dataset as the world model, and the agent LLM as the policy. We use the response of the target LLM to the prompt generated by the agent LLM as the reward.
  • Figure 2: Training framework of StablePrompt. Generate prompts using the Task agnostic meta-prompt, and calculate the reward of the generated prompts with training data.
  • Figure 3: Illustration comparing APPO to the original PPO. The circle represents the constraint of KL-divergence, and each dot represents the parameter of the agent model at each time step. APPO is robust to incorrect rewards because it maintains an anchor model, while PPO deviates from the optimal prompt as incorrect rewards accumulate.
  • Figure 4: Heatmap of few-shot text classification tasks on diverse target-agent pairs. Reported numbers are an average of 6 datasets. MP : Manual prompt, G2: Gemma-2B, G7: Gemma-7B, M7: Mistral-7B, L8: Llama-3-8B, F11: Falcon-11B. StablePrompt works well with a variety of LLMs.
  • Figure 5: Generated prompts and input in machine learning subset of MMLU dataset. We truncate the latter part of the generated prompt for readability. The full prompt can be found in \ref{['appendix_figure5']}
  • ...and 2 more figures