Table of Contents
Fetching ...

Advances in Optimizing Recurrent Networks

Yoshua Bengio, Nicolas Boulanger-Lewandowski, Razvan Pascanu

TL;DR

The paper addresses the optimization difficulties of training recurrent networks, especially for long-range dependencies, and analyzes why standard SGD struggles with vanishing/exploding gradients. It systematically investigates a suite of techniques—gradient clipping, leaky integration, powerful output models (RBM/NADE), sparse gradients with rectified units, and a simplified Nesterov momentum—to improve both training dynamics and generalization. Through extensive experiments on polyphonic music and Penn Treebank text data, these methods yield consistent improvements in log-likelihood, perplexity, and entropy, at times matching Hessian-free optimization. The findings underscore the importance of gradient dynamics and temporal credit assignment in RNNs and demonstrate that enhanced SGD can rival more complex second-order methods while remaining online-friendly.

Abstract

After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle represent in terms of modelling sequences,their training is plagued by two aspects of the same issue regarding the learning of long-term dependencies. Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment. The experiments are performed on text and music data and show off the combined effects of these techniques in generally improving both training and test error.

Advances in Optimizing Recurrent Networks

TL;DR

The paper addresses the optimization difficulties of training recurrent networks, especially for long-range dependencies, and analyzes why standard SGD struggles with vanishing/exploding gradients. It systematically investigates a suite of techniques—gradient clipping, leaky integration, powerful output models (RBM/NADE), sparse gradients with rectified units, and a simplified Nesterov momentum—to improve both training dynamics and generalization. Through extensive experiments on polyphonic music and Penn Treebank text data, these methods yield consistent improvements in log-likelihood, perplexity, and entropy, at times matching Hessian-free optimization. The findings underscore the importance of gradient dynamics and temporal credit assignment in RNNs and demonstrate that enhanced SGD can rival more complex second-order methods while remaining online-friendly.

Abstract

After a more than decade-long period of relatively little research activity in the area of recurrent neural networks, several new developments will be reviewed here that have allowed substantial progress both in understanding and in technical solutions towards more efficient training of recurrent networks. These advances have been motivated by and related to the optimization issues surrounding deep learning. Although recurrent networks are extremely powerful in what they can in principle represent in terms of modelling sequences,their training is plagued by two aspects of the same issue regarding the learning of long-term dependencies. Experiments reported here evaluate the use of clipping gradients, spanning longer time ranges with leaky integration, advanced momentum techniques, using more powerful output probability models, and encouraging sparser gradients to help symmetry breaking and credit assignment. The experiments are performed on text and music data and show off the combined effects of these techniques in generally improving both training and test error.

Paper Structure

This paper contains 14 sections, 4 equations, 2 tables.