Table of Contents
Fetching ...

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

Ilya Loshchilov, Frank Hutter

TL;DR

The paper investigates using Covariance Matrix Adaptation Evolution Strategy (CMA-ES) as a derivative-free, parallel-friendly alternative to Bayesian optimization for hyperparameter tuning of deep neural networks. It benchmarks CMA-ES against Gaussian-process-based methods (Spearmint with EI and PES) and tree-based Bayesian optimizers (TPE, SMAC) on MNIST, leveraging 30 GPUs. Results show CMA-ES steadily improves validation performance, often achieving sub-0.4% error with substantial parallel budgets, while GP-based methods incur higher wall-clock costs due to their cubic scaling. The study suggests CMA-ES as a competitive component in hyperparameter optimization, especially in high-parallelism regimes, and provides releaseable code and supplementary material for reproducibility.

Abstract

Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We provide a toy example comparing CMA-ES and state-of-the-art Bayesian optimization algorithms for tuning the hyperparameters of a convolutional neural network for the MNIST dataset on 30 GPUs in parallel.

CMA-ES for Hyperparameter Optimization of Deep Neural Networks

TL;DR

The paper investigates using Covariance Matrix Adaptation Evolution Strategy (CMA-ES) as a derivative-free, parallel-friendly alternative to Bayesian optimization for hyperparameter tuning of deep neural networks. It benchmarks CMA-ES against Gaussian-process-based methods (Spearmint with EI and PES) and tree-based Bayesian optimizers (TPE, SMAC) on MNIST, leveraging 30 GPUs. Results show CMA-ES steadily improves validation performance, often achieving sub-0.4% error with substantial parallel budgets, while GP-based methods incur higher wall-clock costs due to their cubic scaling. The study suggests CMA-ES as a competitive component in hyperparameter optimization, especially in high-parallelism regimes, and provides releaseable code and supplementary material for reproducibility.

Abstract

Hyperparameters of deep neural networks are often optimized by grid search, random search or Bayesian optimization. As an alternative, we propose to use the Covariance Matrix Adaptation Evolution Strategy (CMA-ES), which is known for its state-of-the-art performance in derivative-free optimization. CMA-ES has some useful invariance properties and is friendly to parallel evaluations of solutions. We provide a toy example comparing CMA-ES and state-of-the-art Bayesian optimization algorithms for tuning the hyperparameters of a convolutional neural network for the MNIST dataset on 30 GPUs in parallel.

Paper Structure

This paper contains 1 section, 5 figures, 1 table.

Table of Contents

  1. Supplementary Material

Figures (5)

  • Figure 1: Best validation errors CMA-ES found for AdaDelta and Adam with and without batch selection when hyperparameters are optimized by CMA-ES with training time budgets of 5 minutes (left) and 30 minutes (right).
  • Figure 2: Comparison of optimizers for Adam with batch selection when solutions are evaluated sequentially for 5 minutes each (left), and in parallel for 30 minutes each (right). Note that the red dots for CMA-ES were plotted first and are in the background of the figure (see also Figure \ref{['FigureMNISThist']} in the supplementary material for an alternative representation of the results).
  • Figure 3: Likelihoods of hyperparameter values to appear in the first 30 evaluations (dotted lines) and last 100 evaluations (bold lines) out of 1000 for CMA-ES and TPE with Gaussian priors during hyperparameter optimization on the MNIST dataset. We used kernel density estimation via diffusion by botev2010kernel with 256 mesh points.
  • Figure 4: Likelihoods of validation errors on MNIST found by different algorithms as estimated from all evaluated solutions with the kernel density estimator by botev2010kernel with 5000 mesh points. Since the estimator does not fit well the outliers in the region of about 90% error, we additionally supply the information about the percentage of the cases when the validation error was greater than 70% (i.e., divergence or close to divergence results), see the legend.
  • Figure 5: Preliminary results not discussed in the main paper. Validation errors on CIFAR-10 found by Adam when hyperparameters are optimized by CMA-ES and TPE with Gaussian priors with training time budgets of 60 and 120 minutes. No data augmentation is used, only ZCA whitening is applied. Hyperparameter ranges are different from the ones given in Table 1 as the structure of the network is different, it is deeper.