Automatic Gradient Descent: Deep Learning without Hyperparameters

Jeremy Bernstein; Chris Mingard; Kevin Huang; Navid Azizan; Yisong Yue

Automatic Gradient Descent: Deep Learning without Hyperparameters

Jeremy Bernstein, Chris Mingard, Kevin Huang, Navid Azizan, Yisong Yue

TL;DR

The paper introduces automatic gradient descent (AGD), a hyperparameter-free optimizer that explicitly accounts for neural architecture by embedding it into a majorise-minimise framework via functional expansion and deep relative trust. By linking Bregman divergences with architecture-aware perturbation bounds, AGD derives layer-wise updates that scale with network depth and width, and reduces hyperparameter tuning to a single gain parameter. Theoretical guarantees include a convergence rate to critical points and, under a Polyak-Łojasiewicz-type condition, a bound toward global minima; empirically, AGD trains deep fully-connected and convolutional networks and achieves 65.5% top-1 on ImageNet with ResNet-50, matching or exceeding tuned baselines without hyperparameter tuning. This work provides a rigorous, architecture-driven optimisation paradigm that could substantially reduce computational cost and improve reproducibility in large-scale deep learning.

Abstract

The architecture of a deep neural network is defined explicitly in terms of the number of layers, the width of each layer and the general network topology. Existing optimisation frameworks neglect this information in favour of implicit architectural information (e.g. second-order methods) or architecture-agnostic distance functions (e.g. mirror descent). Meanwhile, the most popular optimiser in practice, Adam, is based on heuristics. This paper builds a new framework for deriving optimisation algorithms that explicitly leverage neural architecture. The theory extends mirror descent to non-convex composite objective functions: the idea is to transform a Bregman divergence to account for the non-linear structure of neural architecture. Working through the details for deep fully-connected networks yields automatic gradient descent: a first-order optimiser without any hyperparameters. Automatic gradient descent trains both fully-connected and convolutional networks out-of-the-box and at ImageNet scale. A PyTorch implementation is available at https://github.com/jxbz/agd and also in Appendix B. Overall, the paper supplies a rigorous theoretical foundation for a next-generation of architecture-dependent optimisers that work automatically and without hyperparameters.

Automatic Gradient Descent: Deep Learning without Hyperparameters

TL;DR

Abstract

Paper Structure (21 sections, 28 theorems, 54 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 21 sections, 28 theorems, 54 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Summary of contributions
Related work
Optimisation theory
Deep learning theory
Preliminaries
Majorise-Minimise for Generic Learning Problems
Decomposition of linearisation error
Functional expansion and functional majorisation
Recovering existing frameworks
Mirror descent
Gauss-Newton method
Natural gradient descent
Majorise-Minimise for Deep Learning Problems
Deriving automatic gradient descent
...and 6 more sections

Key Result

Proposition 1

For any differentiable loss $\ell$ and any differentiable machine learning model ${\bm{f}}$ the linearisation error of the objective function $\mathcal{L}$ admits the following decomposition:

Figures (7)

Figure 1: Automatic gradient descent trains neural networks reliably without hyperparameters. Solid lines show train accuracy and dotted lines show test accuracy. The networks are unregularised with biases and affine parameters disabled, as these features are not yet supported by AGD. In the left panel---unlike AGD---Adam and SGD failed to train a 32-layer fully-connected network on CIFAR-10 with their default learning rates of 0.001 for Adam and 0.1 for SGD. The middle panel displays a learning rate grid search for ResNet-18 trained on CIFAR-10. AGD attained performance comparable to the best tuned performance of Adam and SGD. In the right panel, AGD trained ResNet-50 on ImageNet to a top-1 test accuracy of 65.5%. The ImageNet baseline is SGD with a learning rate of 0.1 and no learning rate decay schedule.
Figure 2: Majorise-minimise and the perturbation hierarchy. The left panel depicts the majorise-minimise meta-algorithm mm, which is an algorithmic pattern for reducing an objective (blue) by minimising a sequence of upper bounds (one shown in red). The upper bounds, known as a majorisation, must lie tangent to the objective to guarantee an improvement in one step of the meta-algorithm. The right panel depicts the perturbation hierarchy of a generic machine learning model: the optimiser perturbs the weights and this induces perturbations to the model output, the loss on individual training examples and ultimately the overall objective. Majorising machine learning objective functions requires addressing the full perturbation hierarchy.
Figure 3: Perturbation hierarchy of a deep neural network. When training a neural network, the optimiser applies structured perturbations to the weights, in the form of one perturbation matrix $\Delta {\bm{W}}_k$ per weight matrix ${\bm{W}}_k$. Deep relative trust my-fromage provides a tool to understand how structured weight perturbations of this form affect the network output ${\bm{f}}$. Combining deep relative trust with a Bregman divergence bregman1967relaxation allows us to analyse the full perturbation hierarchy.
Figure 4: Benchmarking automatic gradient descent on a range of architectures and datasets. Solid lines are AGD and faint dashed lines are tuned Adam except for ImageNet where the dashed line is SGD with a fixed learning rate of 0.1. ImageNet used cross-entropy loss with a mini-batch size of 1024. The other experiments used square loss with a mini-batch size of 128. The top row plots the automatic learning rate ($\eta$ in the main text) and objective value. The maximum and minimum learning rate for each epoch is included in addition to the mean for the first three plots. The bottom row shows the train and test accuracy.
Figure 5: Comparing automatic gradient descent to tuned Adam and SGD. An eight-layer fully-connected network was trained on CIFAR-10 with square loss. Dotted lines show test and solid lines show train performance. The left panel shows the objective value: AGD and Adam attained a smaller training objective than SGD. The middle panel shows train and test accuracies. The right panel shows the relative update size averaged over layers: $\tfrac{1}{L}\sum_{k=1}^L \Vert {\Delta {\bm{W}}_k} \Vert_F/\Vert {{\bm{W}}_k} \Vert_F$. We plot the maximum, minimum and mean over an epoch.
...and 2 more figures

Theorems & Definitions (55)

Definition 1: Manhattan norm
Definition 2: Euclidean norm
Definition 3: Infinity norm
Definition 4: Frobenius norm
Definition 5: Operator norm
Definition 6: Rank
Definition 7: Stable rank
Definition 8: Composite objective
Example 1: Square loss
Example 2: Xent loss
...and 45 more

Automatic Gradient Descent: Deep Learning without Hyperparameters

TL;DR

Abstract

Automatic Gradient Descent: Deep Learning without Hyperparameters

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (55)