Table of Contents
Fetching ...

QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning

Minghan Fu, Fang-Xiang Wu

TL;DR

This study proposes a novel learning rate adaptation scheme called QLABGrad, which automatically determines the learning rate by optimizing the quadratic loss approximation-based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required.

Abstract

The learning rate is a critical hyperparameter for deep learning tasks since it determines the extent to which the model parameters are updated during the learning course. However, the choice of learning rates typically depends on empirical judgment, which may not result in satisfactory outcomes without intensive try-and-error experiments. In this study, we propose a novel learning rate adaptation scheme called QLABGrad. Without any user-specified hyperparameter, QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required. We theoretically prove the convergence of QLABGrad with a smooth Lipschitz condition on the loss function. Experiment results on multiple architectures, including MLP, CNN, and ResNet, on MNIST, CIFAR10, and ImageNet datasets, demonstrate that QLABGrad outperforms various competing schemes for deep learning.

QLABGrad: a Hyperparameter-Free and Convergence-Guaranteed Scheme for Deep Learning

TL;DR

This study proposes a novel learning rate adaptation scheme called QLABGrad, which automatically determines the learning rate by optimizing the quadratic loss approximation-based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required.

Abstract

The learning rate is a critical hyperparameter for deep learning tasks since it determines the extent to which the model parameters are updated during the learning course. However, the choice of learning rates typically depends on empirical judgment, which may not result in satisfactory outcomes without intensive try-and-error experiments. In this study, we propose a novel learning rate adaptation scheme called QLABGrad. Without any user-specified hyperparameter, QLABGrad automatically determines the learning rate by optimizing the Quadratic Loss Approximation-Based (QLAB) function for a given gradient descent direction, where only one extra forward propagation is required. We theoretically prove the convergence of QLABGrad with a smooth Lipschitz condition on the loss function. Experiment results on multiple architectures, including MLP, CNN, and ResNet, on MNIST, CIFAR10, and ImageNet datasets, demonstrate that QLABGrad outperforms various competing schemes for deep learning.
Paper Structure (22 sections, 21 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 21 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: Illustration of QLAB.
  • Figure 2: Compares QLABGrad to common optimizers (SGD and Adam) on multimodal functions. The blue dot represents the initial position on the loss surface, and our goal is to illustrate how QLABGrad operates. Compared to other methods, QLABGrad exhibits a more aggressive movement across the objective function surface, which allows it to reach the local minimum value much faster. Specifically, QLABGrad achieves the optimal point on the Booth function, Eggholder function, and Himmelbau function in only 50, 500, and 90 iterations, respectively. In contrast, SGD requires 200, 1500, and 220 iterations, while Adam needs 5000, 10000, and 5000 iterations.
  • Figure 3: Learning rate variations of HGD, L4GD, LQA and QLABGrad for training MLP (a) and CNN (b) model on MNIST dataset and training ResNet18 on CIFAR-10 (c) and Tiny-ImageNet dataset (d). The X-axis represents the number of iterations and the Y-axis indicates the learning rate values.
  • Figure 4: Training loss for MLP on the MNIST dataset. The X-axis represents the number of iterations and the Y-axis indicates the learning rate values. A zoomed-in view of the loss changes during the initial 10,000 iterations is demonstrated in the top-right corner of each corresponding subplot, with the loss variation ranging from 0 to 0.2.
  • Figure 5: Training loss for CNN on the MNIST dataset, which adheres to the settings established in Figure 4. The overall loss variations in each subplot range from 0 to 1, with iterations spanning from 0 to 30,000. The zoomed-in view shows the loss changes across the last 10,000 iterations, with the loss variation ranging from 0 to 0.2.
  • ...and 3 more figures