EA4LLM: A Gradient-Free Approach to Large Language Model Optimization via Evolutionary Algorithms
WenTao Liu, Siyu Song, Hao Hao, Aimin Zhou
TL;DR
Gradient-based LLM optimization demands substantial hardware and differentiability, limiting accessibility. EA4LLM introduces a gradient-free Evolution Strategies framework that performs full-parameter pretraining from 0.5B to 32B by tying fitness to token-level log-probabilities and employing subsampled evaluation for efficiency. Empirical results show stable loss reduction and more than 4x memory savings for 4B+ models compared with gradient-based methods, with ablations highlighting the roles of evaluation budget and population size. This work broadens the optimization toolkit for LLMs, enabling lower-cost participation in deep learning research while challenging the notion that gradients are strictly necessary for high-performance neural networks.
Abstract
In recent years, large language models (LLMs) have made remarkable progress, with model optimization primarily relying on gradient-based optimizers such as Adam. However, these gradient-based methods impose stringent hardware requirements, demanding high-concurrency, high-memory GPUs. Moreover, they require all neural network operations to be differentiable, thereby excluding many promising non-differentiable architectures from practical use. To address these limitations, we propose EA4LLM, an evolutionary algorithm for optimizing LLMs, and, for the first time, empirically verify full-parameter optimization from the pretraining stage across model sizes ranging from 0.5B to 32B. We conduct extensive experiments and provide key insights into how evolutionary algorithms can effectively optimize neural networks. Our work challenges the prevailing assumption that gradient-based optimization is the only viable approach for training neural networks. It also holds significant potential to reduce the computational cost of training large language models, thereby enabling groups with limited computational resources to participate in deep learning research.
