Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

Brandon Morgan; Dean Hougen

Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

Brandon Morgan, Dean Hougen

TL;DR

This paper tackles the problem of optimizer selection by introducing a dual-joint neural optimizer search (NOS) space that evolves both the weight-update equation and internal decay/learning-rate schedules. It combines a NASNet-style graph representation with a mutation-only, particle-based GA, an integrity check, and a surrogate evaluator to efficiently explore a vast design space, followed by a progressive optimizer-elimination protocol to ensure transferability to large models. The authors discover several optimizers and Adam variants that outperform standard optimizers like Adam across CIFAR-10/100, TinyImageNet, and fine-tuning tasks, demonstrating robust generalization and improved learning-rate dynamics. The results show that jointly optimizing the update rule and LR/decay mechanisms yields transferable, high-performing optimizers, with practical impact for automated optimizer discovery and large-scale training.

Abstract

A major contributor to the quality of a deep learning model is the selection of the optimizer. We propose a new dual-joint search space in the realm of neural optimizer search (NOS), along with an integrity check, to automate the process of finding deep learning optimizers. Our dual-joint search space simultaneously allows for the optimization of not only the update equation, but also internal decay functions and learning rate schedules for optimizers. We search the space using our proposed mutation-only, particle-based genetic algorithm able to be massively parallelized for our domain-specific problem. We evaluate our candidate optimizers on the CIFAR-10 dataset using a small ConvNet. To assess generalization, the final optimizers were then transferred to large-scale image classification on CIFAR- 100 and TinyImageNet, while also being fine-tuned on Flowers102, Cars196, and Caltech101 using EfficientNetV2Small. We found multiple optimizers, learning rate schedules, and Adam variants that outperformed Adam, as well as other standard deep learning optimizers, across the image classification tasks.

Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

TL;DR

Abstract

Paper Structure (24 sections, 1 equation, 15 figures, 13 tables, 1 algorithm)

This paper contains 24 sections, 1 equation, 15 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Methodology
Search Space
Optimizer
Decay Function
Integrity Check
Surrogate Function
Early Stopping
Particle-Based Genetic Algorithm
Optimizer Elimination Protocol
Adam Variants
Results
Final Optimizers
Supplementary Experiments
...and 9 more sections

Figures (15)

Figure 1: Example optimizer graph with two active (blue) hidden state nodes, two inactive hidden state nodes (grey), and one root node (white). The final weight update equation is given above the root node. Note that this does not include momentum type.
Figure 2: Example decay function applied to the $10^{-4}w$ operand before being applied in the $\text{ln}(|x|)$ operation. The decay graph contains one active (blue) hidden state node, zero inactive hidden state nodes (grey), and one root node (white). The final decay function equation is given above the root node.
Figure 3: Learning Rate Family 1
Figure 4: Learning Rate Family 2
Figure 5: Outsider Learning Rates
...and 10 more figures

Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

TL;DR

Abstract

Neural Optimizer Equation, Decay Function, and Learning Rate Schedule Joint Evolution

Authors

TL;DR

Abstract

Table of Contents

Figures (15)