Table of Contents
Fetching ...

ESSA: Evolutionary Strategies for Scalable Alignment

Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Alexey Khokhulin, Nikita Surnachev, Kirill Ovcharenko, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov

TL;DR

ESSA introduces a gradient-free approach to LLM alignment by constraining optimization to the singular values of SVD-decomposed LoRA adapters, after a supervised fine-tune warm-start. By using CMA-ES on a compact, low-rank subspace and enabling inference-only, quantized operation, it achieves competitive or superior alignment quality compared to gradient-based GRPO, while significantly reducing training complexity and wall-clock time. The method scales well across model sizes and hardware, with strong robustness to hyperparameters and favorable parallelization, making it a practical alternative for large-scale alignment. The combination of SVD-LoRA parameterization, forward-only evaluation, and low communication overhead demonstrates a compelling route to scalable, hardware-friendly LLM alignment, albeit with caveats related to SFT dependence and fixed-rank limitations.

Abstract

Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and allows efficient operation in quantized INT4 and INT8 inference mode. Across these benchmarks ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all compared with GRPO. In large-scale settings ESSA shows stronger scaling than gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared with GRPO. These results position evolutionary strategies as a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead.

ESSA: Evolutionary Strategies for Scalable Alignment

TL;DR

ESSA introduces a gradient-free approach to LLM alignment by constraining optimization to the singular values of SVD-decomposed LoRA adapters, after a supervised fine-tune warm-start. By using CMA-ES on a compact, low-rank subspace and enabling inference-only, quantized operation, it achieves competitive or superior alignment quality compared to gradient-based GRPO, while significantly reducing training complexity and wall-clock time. The method scales well across model sizes and hardware, with strong robustness to hyperparameters and favorable parallelization, making it a practical alternative for large-scale alignment. The combination of SVD-LoRA parameterization, forward-only evaluation, and low communication overhead demonstrates a compelling route to scalable, hardware-friendly LLM alignment, albeit with caveats related to SFT dependence and fixed-rank limitations.

Abstract

Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and allows efficient operation in quantized INT4 and INT8 inference mode. Across these benchmarks ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all compared with GRPO. In large-scale settings ESSA shows stronger scaling than gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared with GRPO. These results position evolutionary strategies as a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead.

Paper Structure

This paper contains 53 sections, 2 theorems, 10 equations, 34 figures, 8 tables, 1 algorithm.

Key Result

Lemma B.1

The minimum of $T^{\mathrm{grad}}_{\mathrm{fb\hbox{-}gen}}(\theta)$ over $\theta\in(0,1)$ is attained at

Figures (34)

  • Figure 1: Illustration of the ESSA framework. LoRA adapters are first initialized via SFT and decomposed into fixed SVD bases with trainable singular values. The term device N denotes the GPU worker in distributed evaluation. CMA-ES receives a seed at each device, generates a population of size N+1 locally, evaluates a different candidate, and returns a reward. Each candidate is a vector $\sigma_i \in \mathbb{R}^{\textbf{solution\_length}}, i = 0, 1, \dots, \text{N}$ and is added to SVD vectors of the training matrices. It is partitioned into contiguous slices, each of which corresponds to one LoRA matrix (e.g. $W_{Q}, W_K, W_V, W_O$ for each transformer layer) and contains $2 \cdot \textbf{LoRA\_rank}$ singular values (for factors $A$ and $B$). The solution length is the dimensionality of the candidate vector, i.e., the concatenation of all perturbation of the trainable LoRA singular values across all matrices and layers. With the number of layers (num_layers), number of matrices per layer (num_matrices_per_layer), LoRA_rank: $\textbf{num\_matrices} = \textbf{num\_layers} \cdot \textbf{num\_matrices\_per\_layer}$ and $\textbf{solution\_length}= \textbf{num\_matrices} \cdot \textbf{LoRA\_rank} \cdot 2$.
  • Figure 2: Hyperparameter sensitivity of ESSA on Qwen2.5-Math-7B for GSM8K. Batch size 100. (a) Accuracy when varying LoRA rank and population size. (b) For each LoRA rank, the population size is fixed to the best value found in (a), while the percentage $\alpha$ of trainable singular values is varied. This illustrates how ESSA performance depends jointly on adapter rank and the fraction of singular values optimized. The single white cell occurs because for LoRA rank 8 and $\alpha\!=\!0.1$, rounding down yields zero trainable singular values, so no valid accuracy is reported.
  • Figure 3: GRPO and ESSA scaling on PRM800K with Qwen2.5-32B: time to reach $0.835$ accuracy vs. GPU count. ESSA: LoRA rank 16, pop. 128, batch size 256, $\alpha\!=\!1.0$. GRPO: LoRA rank 16, lr $1\!\times\!10^{-5}$, global batch 512, mini batch 64.
  • Figure 4: Validation accuracy over time on GSM8K with Qwen2.5-Math-7B. Panels (a)-(e) correspond to LoRA ranks 32, 16, 8, 4, and 2, respectively. ESSA (blue): batch size 100. GRPO (red): lr $1\!\times\!10^{-5}$, global batch 512, mini batch 64. ESSA rises rapidly and plateaus early across all ranks, while GRPO improves more gradually.
  • Figure 5: Validation accuracy over time on PRM800K. Qwen2.5-32B with LoRA rank 8 for both methods (a) and Qwen2.5-72B with LoRA rank 4 for both methods (b). For Qwen2.5-72B we run ESSA under BFLOAT16 with tensor parallelism ($TP$): $TP=2$ and $TP=4$, and under INT4 with $TP=1$, keeping the total GPU budget at 32 for both methods. ESSA (blue): batch size 256. GRPO (red): lr $1\!\times\!10^{-5}$, global batch 512, mini batch 64. Across both scales, ESSA reaches strong validation accuracy earlier and matches or exceeds GRPO throughout.
  • ...and 29 more figures

Theorems & Definitions (4)

  • Lemma B.1: Optimal split
  • proof
  • Theorem B.2: ESSA iteration is faster under a conservative bound
  • proof