Table of Contents
Fetching ...

Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Yinggan Xu, Roberto Dailey, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen

TL;DR

This work demonstrates that Evolution Strategies (ES) can scale to full-parameter fine-tuning of billion-parameter LLMs without dimensionality reduction, offering a gradient-free alternative to reinforcement learning (RL). Using a memory-efficient, highly parallelizable ES variant with population size $N=30$ and constants $sigma=0.001$ and $alpha=5e-4$, the authors show ES can outperform state-of-the-art RL methods on long-horizon reasoning tasks (e.g., Countdown), while remaining robust across multiple base LLMs and reducing reward hacking. ES also achieves competitive results on math-reasoning benchmarks and solves hard puzzles like ARC-AGI and Sudoku, illustrating broad generalization beyond standard RL tasks. The findings broaden the algorithmic design space for LLM post-training, enabling scalable, robust, and accessible fine-tuning without backpropagation, with significant implications for alignment, safety, and practical deployment.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks is an essential stage of modern AI deployment. Reinforcement learning (RL) has emerged as the dominant fine-tuning paradigm, underpinning many state-of-the-art LLMs. In contrast, evolution strategies (ES) has largely been overlooked due to the widespread belief that it does not scale to modern model sizes. This paper overturns this assumption by demonstrating the first successful application of ES to full-parameter fine-tuning of LLMs at the billion-parameter scale, without dimensionality reduction. ES can indeed search over extremely high-dimensional parameter spaces and outperform established RL implementations across multiple axes, including improved tolerance to long-horizon and delayed rewards, robustness across diverse base LLMs, reduced susceptibility to reward hacking, and improved training stability. These findings suggest that ES is not merely a viable alternative to RL, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning beyond current RL-based approaches. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

TL;DR

This work demonstrates that Evolution Strategies (ES) can scale to full-parameter fine-tuning of billion-parameter LLMs without dimensionality reduction, offering a gradient-free alternative to reinforcement learning (RL). Using a memory-efficient, highly parallelizable ES variant with population size and constants and , the authors show ES can outperform state-of-the-art RL methods on long-horizon reasoning tasks (e.g., Countdown), while remaining robust across multiple base LLMs and reducing reward hacking. ES also achieves competitive results on math-reasoning benchmarks and solves hard puzzles like ARC-AGI and Sudoku, illustrating broad generalization beyond standard RL tasks. The findings broaden the algorithmic design space for LLM post-training, enabling scalable, robust, and accessible fine-tuning without backpropagation, with significant implications for alignment, safety, and practical deployment.

Abstract

Fine-tuning large language models (LLMs) for downstream tasks is an essential stage of modern AI deployment. Reinforcement learning (RL) has emerged as the dominant fine-tuning paradigm, underpinning many state-of-the-art LLMs. In contrast, evolution strategies (ES) has largely been overlooked due to the widespread belief that it does not scale to modern model sizes. This paper overturns this assumption by demonstrating the first successful application of ES to full-parameter fine-tuning of LLMs at the billion-parameter scale, without dimensionality reduction. ES can indeed search over extremely high-dimensional parameter spaces and outperform established RL implementations across multiple axes, including improved tolerance to long-horizon and delayed rewards, robustness across diverse base LLMs, reduced susceptibility to reward hacking, and improved training stability. These findings suggest that ES is not merely a viable alternative to RL, but a fundamentally different and powerful backpropagation-free post-training paradigm that opens a new direction for LLM fine-tuning beyond current RL-based approaches. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

Paper Structure

This paper contains 33 sections, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Mean conciseness reward and mean KL divergence from the base model for each fine-tuning checkpoint across different learning parameters. The Pareto front of ES (blue line) is higher and to the left of the GRPO Pareto front (black line) models, indicating that it found better tradeoffs. ES discovers these solutions without any KL divergence penalty, suggesting that it represents a distinctly different fine-tuning mechanism from the GRPO.
  • Figure 2: Performance of ES compared to strong, well-establised RL baselines across math reasoning benchmarks. Across all benchmarks, ES achieved competitive performance compared to OpenReasoner-Zero-7B (PPO), Simple-RL-Zero (GRPO), Oat-Zero-7B (Dr.GRPO). Given the vanilla nature of the current ES implementation, these results constitute a promising starting point for ES fine tuning in math.
  • Figure 3: An example of a countdown task interaction.
  • Figure 4: GRPO behavior with different learning rates. (a) GRPO models trained using different learning rates and $\beta$ values. Both conciseness reward and KL divergence increase with higher learning rates. (b) The ES Pareto front (blue line, top) plotted with the GRPO Pareto front (black line, bottom) over different model learning parameters. ES dominates GRPO across the whole range.
  • Figure 5: Reward distributions in fine-tuning for conciseness with different learning rates $\alpha =\{2\times10^{-6}$, $3\times10^{-6}$, $4\times10^{-6}, 5\times10^{-6} \}$ and $\beta = \{0.0, 0.01, 0.1, 1.0\}$ compared to ES on the Qwen2.5-7B-Instruct base model. Whereas GRPO distribution is similar to the base model, ES shifts it to the right, i.e. higher rewards. Higher rewards can only be achieved with GRPO with high learning rates and low $\beta$, which setting often breaks to model's performance.
  • ...and 7 more figures