Table of Contents
Fetching ...

Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

William Hoy, Binxu Wang, Xu Pan

Abstract

Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: https://github.com/Bhoy1/ESvsGRPO.

Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training

Abstract

Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: https://github.com/Bhoy1/ESvsGRPO.

Paper Structure

This paper contains 73 sections, 8 theorems, 169 equations, 11 figures, 7 tables.

Key Result

Proposition 1

Let $R(\theta) = \text{const}$, so all reward variation arises from observation noise. Under the ES update $\Delta\theta = \frac{\alpha}{N}\sum_{i=1}^N Z_i \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, I_d)$ and z-scored rewards $Z_i$, the weight update is a pure isotropic random walk: After $T$ steps of ES with population size $N$, step size $\alpha$, and parameter dimension $d$, the cumulati

Figures (11)

  • Figure 1: Accuracy (%) on each task throughout sequential training. The shaded region marks the stage at which that task was trained. ES (300) shows clear degradation on earlier tasks as training progresses, while ES (100) and GRPO remain comparatively stable.
  • Figure 2: Incremental KL divergence per training step. Each cell shows $\text{KL}(\pi_{\text{after}} \| \pi_{\text{before}})$ for a given evaluation task (row) after a given training step (column). Both panels share the same color scale. ES exhibits broader off diagonal drift while GRPO's changes are more localized to the diagonal.
  • Figure 3: Per-task accuracy along the linear interpolation path between the final ES and GRPO checkpoints. The dashed gray line shows the linear interpolation between endpoint accuracies. Across all tasks, the interpolated model remains close to or above the linear baseline with no catastrophic accuracy drop, indicating that the two solutions lie in the same basin of the loss landscape.
  • Figure 4: Task accuracy when moving from the base model along ES (blue), GRPO (red), and random (gray) directions at increasing magnitudes. Dashed vertical lines mark $\|\Delta_{\text{GRPO}}\|$ (red) and $\|\Delta_{\text{ES}}\|$ (blue). Dotted horizontal lines show base and checkpoint accuracies.
  • Figure 5: Holdout task accuracy (IFEval, MMLU) when perturbing along ES (blue) and GRPO (red) directions from the final BoolQ checkpoint.
  • ...and 6 more figures

Theorems & Definitions (22)

  • Remark 1
  • Proposition 1: Evolution Strategy on a Flat Landscape
  • Proposition 2: Evolution Strategy on a Linear Landscape
  • Remark 2: On-manifold alignment of ES update
  • Proposition 3: Evolution Strategy on a Quadratic Landscape
  • Remark 3: On-manifold alignment on quadratic landscape
  • Proposition 4: Dynamics of ES on Quadratic Landscape
  • Remark 4: Stability conditions
  • Remark 5: Optimal curvature
  • Remark 6: Role of noise scale
  • ...and 12 more