Table of Contents
Fetching ...

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Alkis Sygkounas, Amy Loutfi, Andreas Persson

Abstract

Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Abstract

Reinforcement learning algorithms are defined by their learning update rules, which are typically hand-designed and fixed. We present an evolutionary framework for discovering reinforcement learning algorithms by searching directly over executable update rules that implement complete training procedures. The approach builds on REvolve, an evolutionary system that uses large language models as generative variation operators, and extends it from reward-function discovery to algorithm discovery. To promote the emergence of nonstandard learning rules, the search excludes canonical mechanisms such as actor--critic structures, temporal-difference losses, and value bootstrapping. Because reinforcement learning algorithms are highly sensitive to internal scalar parameters, we introduce a post-evolution refinement stage in which a large language model proposes feasible hyperparameter ranges for each evolved update rule. Evaluated end-to-end by full training runs on multiple Gymnasium benchmarks, the discovered algorithms achieve competitive performance relative to established baselines, including SAC, PPO, DQN, and A2C.

Paper Structure

This paper contains 50 sections, 36 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustrative overview of the proposed method. A population of candidate algorithms is iteratively evolved. In each generation (solid arrows), a large language model proposes coherent variants ($\mathcal{L}_{f_1}, \mathcal{L}_{f_2}, \ldots, \mathcal{L}_{f_n}$), which are evaluated via training in Gymnasium environments to obtain fitness scores ($\bar{F}^{(g)}_k$) that are subsequently used for selection and population updates. Post-evolution (dashed arrows), a hyperparameter setting ($\beta_i^\star$), selected via LLM-guided optimization, optimizes the resulting best update rule ($\mathcal{L}^\star$), which is then advanced for final evaluation.
  • Figure 2: Representative Gymnasium environments used for training (top row) and evaluation (bottom row).
  • Figure 3: Evolution of maximum population fitness across generations for GPT-5.2 and Claude 4.5 Opus. Curves show the mean across two evolutionary seeds, with shaded regions indicating the standard deviation across seeds.
  • Figure 4: Seed-averaged evaluation learning curves for the two evolved algorithms across ten environments. Top: CG-FPD. Bottom: DF-CWP-CP. Curves show evaluation return smoothed with a moving average; shaded regions indicate one standard deviation across seeds. CartPole, InvertedPendulum, and MountainCar are trained for 500k steps, while all other environments are trained for 1M steps. Peak evaluation returns observed at the per-seed level are attenuated in the seed-averaged smoothed curves.
  • Figure 5: Ablation of the Levenshtein regularization weight $\alpha$ in the evolutionary objective (Eq. \ref{['eq:levenstein']}). Curves show best population fitness per generation for $\alpha=0$ and $\alpha=1$, averaged across evolutionary seeds.