Table of Contents
Fetching ...

Bolstering Stochastic Gradient Descent with Model Building

S. Ilker Birbil, Ozgur Martin, Gonenc Onay, Figen Oztoprak

TL;DR

This work proposes an alternative approach to stochastic line search by using a new algorithm based on forward step model building that achieves faster convergence and better generalization in well-known test problems, and shows comparable performance to other adaptive methods.

Abstract

Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.

Bolstering Stochastic Gradient Descent with Model Building

TL;DR

This work proposes an alternative approach to stochastic line search by using a new algorithm based on forward step model building that achieves faster convergence and better generalization in well-known test problems, and shows comparable performance to other adaptive methods.

Abstract

Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the step length. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the step length but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.

Paper Structure

This paper contains 7 sections, 35 equations, 6 figures, 1 table, 2 algorithms.

Figures (6)

  • Figure 1: An iteration of SMB on a simple quadratic function. We assume for simplicity that there is only one parameter group, and hence, we drop the subscript $p$ . The algorithm first computes the trial point $x_k^t$ by taking the (stochastic) gradient step $s_k^t$. If this point is not acceptable, then it builds a model using the information at $x_k$ and $x_k^t$, and computes the next iterate $x_{k+1}=x_k+s_k$. Note that $s_k$ not only have a smaller length compared to the trial step $s_k^t$, but it also lies along a direction decreasing the function value.
  • Figure 2: The coefficients of $g_k$ and $g_k^t$ during a single-epoch run of SMB on the MNIST data with $\alpha=0.5$. Model steps are taken quite often, but not at all iterations. The sum of the two coefficients vary in [-0.5,-0.25].
  • Figure 3: Classification on MNIST with an MLP model.
  • Figure 4: Classification on CIFAR10 (left column) and CIFAR100 (right column) with ResNet-34 model.
  • Figure 5: Classification on CIFAR10 (left column) and CIFAR100 (right column) with Densenet121 model.
  • ...and 1 more figures