Table of Contents
Fetching ...

Learning to Optimize Neural Nets

Ke Li, Jitendra Malik

TL;DR

The paper tackles learning optimization algorithms capable of training high-dimensional, stochastic neural networks without hand-designed update rules. It extends Learning to Optimize through Guided Policy Search (GPS) and a convolutional GPS architecture that exploits permutation structure in neural nets. By meta-training on MNIST-like shallow nets, the learned optimizer (Predicted Step Descent) generalizes to Toronto Faces Dataset and CIFAR-10/100 and remains robust to gradient noise and architectural changes. The results show the learned optimizer can outperform standard hand-engineered methods and previous learned optimizers, suggesting practical potential for automated optimizer design.

Abstract

Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100.

Learning to Optimize Neural Nets

TL;DR

The paper tackles learning optimization algorithms capable of training high-dimensional, stochastic neural networks without hand-designed update rules. It extends Learning to Optimize through Guided Policy Search (GPS) and a convolutional GPS architecture that exploits permutation structure in neural nets. By meta-training on MNIST-like shallow nets, the learned optimizer (Predicted Step Descent) generalizes to Toronto Faces Dataset and CIFAR-10/100 and remains robust to gradient noise and architectural changes. The results show the learned optimizer can outperform standard hand-engineered methods and previous learned optimizers, suggesting practical potential for automated optimizer design.

Abstract

Learning to Optimize is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR-100.

Paper Structure

This paper contains 14 sections, 7 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the horizontal axis represents the iteration. Best viewed in colour.
  • Figure 2: Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 64. The vertical axis is the true objective value and the horizontal axis represents the iteration. Best viewed in colour.
  • Figure 3: Comparison of the various hand-engineered and learned algorithms on training neural nets with 48 input and hidden units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the horizontal axis represents the iteration. Best viewed in colour.
  • Figure 4: Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 with mini-batches of size 10. The vertical axis is the true objective value and the horizontal axis represents the iteration. Best viewed in colour.
  • Figure 5: Comparison of the various hand-engineered and learned algorithms on training neural nets with 100 input units and 200 hidden units on (a) TFD, (b) CIFAR-10 and (c) CIFAR-100 for 800 iterations with mini-batches of size 64. The vertical axis is the true objective value and the horizontal axis represents the iteration. Best viewed in colour.