Table of Contents
Fetching ...

Reinforcement Learning for Learning Rate Control

Chang Xu, Tao Qin, Gang Wang, Tie-Yan Liu

TL;DR

This paper addresses the challenge of selecting effective learning rates for SGD by casting learning-rate control as a sequential decision problem tackled with an actor-critic reinforcement learning framework. An actor outputs continuous learning-rate actions based on a compact state representation, while a critic estimates long-term performance via TD learning; gradient disagreement is leveraged to stabilize training. Empirical results on MNIST and CIFAR-10 show the approach achieves superior final convergence and smoother training compared with traditional optimizers and prior RL-based methods. The work suggests a promising direction for automated hyperparameter control and highlights future opportunities for per-parameter rates and additional hyperparameters.

Abstract

Stochastic gradient descent (SGD), which updates the model parameters by adding a local gradient times a learning rate at each step, is widely used in model training of machine learning algorithms such as neural networks. It is observed that the models trained by SGD are sensitive to learning rates and good learning rates are problem specific. We propose an algorithm to automatically learn learning rates using neural network based actor-critic methods from deep reinforcement learning (RL).In particular, we train a policy network called actor to decide the learning rate at each step during training, and a value network called critic to give feedback about quality of the decision (e.g., the goodness of the learning rate outputted by the actor) that the actor made. The introduction of auxiliary actor and critic networks helps the main network achieve better performance. Experiments on different datasets and network architectures show that our approach leads to better convergence of SGD than human-designed competitors.

Reinforcement Learning for Learning Rate Control

TL;DR

This paper addresses the challenge of selecting effective learning rates for SGD by casting learning-rate control as a sequential decision problem tackled with an actor-critic reinforcement learning framework. An actor outputs continuous learning-rate actions based on a compact state representation, while a critic estimates long-term performance via TD learning; gradient disagreement is leveraged to stabilize training. Empirical results on MNIST and CIFAR-10 show the approach achieves superior final convergence and smoother training compared with traditional optimizers and prior RL-based methods. The work suggests a promising direction for automated hyperparameter control and highlights future opportunities for per-parameter rates and additional hyperparameters.

Abstract

Stochastic gradient descent (SGD), which updates the model parameters by adding a local gradient times a learning rate at each step, is widely used in model training of machine learning algorithms such as neural networks. It is observed that the models trained by SGD are sensitive to learning rates and good learning rates are problem specific. We propose an algorithm to automatically learn learning rates using neural network based actor-critic methods from deep reinforcement learning (RL).In particular, we train a policy network called actor to decide the learning rate at each step during training, and a value network called critic to give feedback about quality of the decision (e.g., the goodness of the learning rate outputted by the actor) that the actor made. The introduction of auxiliary actor and critic networks helps the main network achieve better performance. Experiments on different datasets and network architectures show that our approach leads to better convergence of SGD than human-designed competitors.

Paper Structure

This paper contains 16 sections, 10 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: The framework of our proposed automatic learning rate controller.
  • Figure 2: Results on MNIST. (a) Training loss. (b) Test loss. The x-axis represents the number of mini batches. The y-axis represents loss value.
  • Figure 3: Results on CIFAR10. (a) Training loss. (b) Test loss. The x-axis is the number of mini batches. The y-axis represents loss value.
  • Figure 4: Trajectories produced by different algorithms on three random two-dimensional regression problems. The axes represent the values of the two dimensions. The contours outline the area with the same target value, and the target value is gradually decreasing from orange area to blue area. Each arrow represents one iteration of an algorithm, whose tail and tip correspond to the preceding and subsequent iterations respectively.
  • Figure 5: Gradient disagreement, training loss and test loss of SGD and our method on a two-dimensional regression problem.