Table of Contents
Fetching ...

Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs

Lunjun Zhang, Ryan Chen, Bradly C. Stadie

TL;DR

Results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization, resulting in improved performance across reasoning and agentic tasks.

Abstract

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME $\rightarrow$ BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% $\rightarrow$ 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E-SPL

Evolutionary System Prompt Learning can Facilitate Reinforcement Learning for LLMs

TL;DR

Results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization, resulting in improved performance across reasoning and agentic tasks.

Abstract

Building agentic systems that can autonomously self-improve from experience is a longstanding goal of AI. Large language models (LLMs) today primarily self-improve via two mechanisms: self-reflection for context updates, and reinforcement learning (RL) for weight updates. In this work, we propose Evolutionary System Prompt Learning (E-SPL), a method for jointly improving model contexts and model weights. In each RL iteration, E-SPL selects multiple system prompts and runs rollouts with each in parallel. It applies RL updates to model weights conditioned on each system prompt, and evolutionary updates to the system prompt population via LLM-driven mutation and crossover. Each system prompt has a TrueSkill rating for evolutionary selection, updated from relative performance within each RL iteration batch. E-SPL encourages a natural division between declarative knowledge encoded in prompts and procedural knowledge encoded in weights, resulting in improved performance across reasoning and agentic tasks. For instance, in an easy-to-hard (AIME BeyondAIME) generalization setting, E-SPL improves RL success rate from 38.8% 45.1% while also outperforming reflective prompt evolution (40.0%). Overall, our results show that coupling reinforcement learning with system prompt evolution yields consistent gains in sample efficiency and generalization. Code: https://github.com/LunjunZhang/E-SPL
Paper Structure (66 sections, 58 equations, 22 figures, 1 table, 2 algorithms)

This paper contains 66 sections, 58 equations, 22 figures, 1 table, 2 algorithms.

Figures (22)

  • Figure 1: Evolving System Prompts helps RL generalize better.
  • Figure 2: Evolutionary System Prompt Learning (E-SPL) jointly optimizes model contexts and model weights to enhance LLM self-improvement. Evolution updates contexts; RL updates weights. The learned system prompts can encode declarative knowledge via articulated principles and strategies, while RL gradient updates can focus on honing the model's procedural and implicit knowledge.
  • Figure 3: Evolutionary trees of E-SPL. Ruring RL, E-SPL creates an evolutionary tree of system prompts, by re-suing the same data already generated by RL with minimal additional computational overhead. Each genetic operator (mutation or crossover) only requires a sampling server for self-reflection with different context construction strategies, which can be concurrent with RL gradient updates.
  • Figure 4: Discovered Strategies in learned System Prompts for solving math problems. Those explicit behavior specifications include: useful heuristics and tips for various categories of problems, self-verification strategies such as checking for consistency and plausibility, a list of common failure modes, etc. Note that RL is done under diverse system prompts, and does not overfit to any particular one.
  • Figure 5: LLM-based Mutation operator in E-SPL. The highest-performing prompt in each iteration undergoes self-reflection on group-wise rollout outcomes. An LLM-generated diff edits the parent into a child system prompt, removing ineffective rules and converting observed mistakes into improved declarative instructions, yielding a refined prompt that re-enters the evolutionary population.
  • ...and 17 more figures