Table of Contents
Fetching ...

Batch Normalized Recurrent Neural Networks

César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio

TL;DR

The paper investigates batch normalization for recurrent neural networks to reduce training time. It finds that applying BN to hidden-to-hidden connections is ineffective, while applying BN to input-to-hidden transitions can speed up learning but may worsen generalization, especially with frame-wise vs sequence-wise normalization. Across speech and language modeling tasks, BN variants accelerate training but tend to increase overfitting, with only limited improvements in final performance. The results highlight the challenge of transferring batch normalization from feedforward networks to RNNs and suggest directions such as whitening and alternative normalization strategies for future work.

Abstract

Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.

Batch Normalized Recurrent Neural Networks

TL;DR

The paper investigates batch normalization for recurrent neural networks to reduce training time. It finds that applying BN to hidden-to-hidden connections is ineffective, while applying BN to input-to-hidden transitions can speed up learning but may worsen generalization, especially with frame-wise vs sequence-wise normalization. Across speech and language modeling tasks, BN variants accelerate training but tend to increase overfitting, with only limited improvements in final performance. The results highlight the challenge of transferring batch normalization from feedforward networks to RNNs and suggest directions such as whitening and alternative normalization strategies for future work.

Abstract

Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.

Paper Structure

This paper contains 11 sections, 13 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Frame-wise cross entropy on WSJ for the baseline (blue) and batch normalized (red) networks. The dotted lines are the training curves and the solid lines are the validation curves.
  • Figure 2: Large LSTM on Penn Treebank for the baseline (blue) and the batch normalized (red) networks. The dotted lines are the training curves and the solid lines are the validation curves.
  • Figure 3: Typical training curves obtained during the grid search. The baseline network is in blue and batch normalized one in red. For this experiment, the hyper-parameters are: learning rate 7.8e-4, momentum 0.5, batch size 64.