ESSA: Evolutionary Strategies for Scalable Alignment
Daria Korotyshova, Boris Shaposhnikov, Alexey Malakhov, Alexey Khokhulin, Nikita Surnachev, Kirill Ovcharenko, George Bredis, Alexey Gorbatovski, Viacheslav Sinii, Daniil Gavrilov
TL;DR
ESSA introduces a gradient-free approach to LLM alignment by constraining optimization to the singular values of SVD-decomposed LoRA adapters, after a supervised fine-tune warm-start. By using CMA-ES on a compact, low-rank subspace and enabling inference-only, quantized operation, it achieves competitive or superior alignment quality compared to gradient-based GRPO, while significantly reducing training complexity and wall-clock time. The method scales well across model sizes and hardware, with strong robustness to hyperparameters and favorable parallelization, making it a practical alternative for large-scale alignment. The combination of SVD-LoRA parameterization, forward-only evaluation, and low communication overhead demonstrates a compelling route to scalable, hardware-friendly LLM alignment, albeit with caveats related to SFT dependence and fixed-rank limitations.
Abstract
Alignment of Large Language Models (LLMs) typically relies on Reinforcement Learning from Human Feedback (RLHF) with gradient-based optimizers such as Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO). While effective, these methods require complex distributed training, large memory budgets, and careful hyperparameter tuning, all of which become increasingly difficult at billion-parameter scale. We present ESSA, Evolutionary Strategies for Scalable Alignment, a gradient-free framework that aligns LLMs using only forward inference and black-box optimization. ESSA focuses optimization on Low-Rank Adapters (LoRA) and further compresses their parameter space by optimizing only the singular values from an singular value decomposition (SVD) of each adapter matrix. This dimensionality reduction makes evolutionary search practical even for very large models and allows efficient operation in quantized INT4 and INT8 inference mode. Across these benchmarks ESSA improves the test accuracy of Qwen2.5-Math-7B by 12.6% on GSM8K and 14.8% on PRM800K, and raises the accuracy of LLaMA3.1-8B on IFEval by 22.5%, all compared with GRPO. In large-scale settings ESSA shows stronger scaling than gradient-based methods: on Qwen2.5-32B for PRM800K it reaches near-optimal accuracy twice as fast on 16 GPUs and six times as fast on 128 GPUs compared with GRPO. These results position evolutionary strategies as a compelling, hardware-friendly alternative to gradient-based LLM alignment, combining competitive quality with substantially reduced wall-clock time and engineering overhead.
