Table of Contents
Fetching ...

CoRe Optimizer: An All-in-One Solution for Machine Learning

Marco Eckhoff, Markus Reiher

TL;DR

This work provides an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse ML tasks.

Abstract

The optimization algorithm and its hyperparameters can significantly affect the training speed and resulting model accuracy in machine learning applications. The wish list for an ideal optimizer includes fast and smooth convergence to low error, low computational demand, and general applicability. Our recently introduced continual resilient (CoRe) optimizer has shown superior performance compared to other state-of-the-art first-order gradient-based optimizers for training lifelong machine learning potentials. In this work we provide an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse machine learning tasks. We analyze the influence of different hyperparameters and provide generally applicable values. The CoRe optimizer yields best or competitive performance in every investigated application, while only one hyperparameter needs to be changed depending on mini-batch or batch learning.

CoRe Optimizer: An All-in-One Solution for Machine Learning

TL;DR

This work provides an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse ML tasks.

Abstract

The optimization algorithm and its hyperparameters can significantly affect the training speed and resulting model accuracy in machine learning applications. The wish list for an ideal optimizer includes fast and smooth convergence to low error, low computational demand, and general applicability. Our recently introduced continual resilient (CoRe) optimizer has shown superior performance compared to other state-of-the-art first-order gradient-based optimizers for training lifelong machine learning potentials. In this work we provide an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse machine learning tasks. We analyze the influence of different hyperparameters and provide generally applicable values. The CoRe optimizer yields best or competitive performance in every investigated application, while only one hyperparameter needs to be changed depending on mini-batch or batch learning.
Paper Structure (19 sections, 21 equations, 5 figures, 1 table)

This paper contains 19 sections, 21 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Bar chart of the final accuracy scores $A_i$ (Equation (\ref{['eq:A_i']})) of various ML tasks $i$ trained by different optimizers. The uncertainty interval (Equation (\ref{['eq:Delta_A_i']})) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of $A_i$ over 20 trainings (100 for NR and RA). $100\%$ corresponds to the highest obtained final accuracy of all optimizer specifications. The acronyms of the ML tasks are explained in Table \ref{['tab:ML_tasks']}. The optimizers are listed in the legends and are represented by different colors. The learning rate (maximal step size for RPROP, RPROP$^*$, and CoRe) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values. Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate. The CoRe results are shown as reference in (b).
  • Figure 2: Bar chart of the final accuracy score $\overline{A}$ (Equation (\ref{['eq:A']})) averaged over all ML tasks shown in Figures \ref{['fig:final_accuracy']} (a) and (b) for different optimizers. The uncertainty interval (Equation (\ref{['eq:Delta_A']})) is shown as cross-hatched bar around the right edge of the bar which equals the value of $\overline{A}$. A value of $100\%$ means that the optimizer achieves highest accuracy in every ML task. The bars are labeled and colored according to the respective optimizer.
  • Figure 3: Bar chart of the final accuracy scores $A_i$ (Equation (\ref{['eq:A_i']})) of energy and force prediction of lMLPs trained by different optimizers. The uncertainty interval (Equation (\ref{['eq:Delta_A_i']})) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of $A_i$ over 20 trainings. $100\%$ corresponds to the highest obtained final accuracy of all optimizer specifications, i.e., the lowest RMSE in the prediction of energies or atomic force components. The colors of most optimizers are listed in the legends of Figures \ref{['fig:final_accuracy']} (a) and (b). The learning rate (maximal step size for RPROP and CoRe specifications) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values. Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate. The CoRe results are shown as reference in (b).
  • Figure 4: Bar chart of the final accuracy score $\overline{A}$ (Equation (\ref{['eq:A']})) combining energy and force prediction of lMLPs for different optimizers. The uncertainty interval (Equation (\ref{['eq:Delta_A']})) is shown as cross-hatched bar around the right edge of the bar which equals the value of $\overline{A}$. A value of $100\%$ means that the optimizer achieves highest accuracy in energy and force prediction. The bars are labeled and colored according to the respective optimizer.
  • Figure 5: Test set RMSEs of (a) energy $E^\mathrm{test}$ and (b) atomic force components $F_{\alpha,n}^\mathrm{test}$ as a function of the training epoch $n_\mathrm{epoch}$ for the lMLP compared to the DFT reference. The results are shown for the eight optimizers yielding highest final accuracy. The less often a line is broken, the lower is the final error. Uncertainty intervals are shown in pale color of the respective line.