Table of Contents
Fetching ...

Unitary Evolution Recurrent Neural Networks

Martin Arjovsky, Amar Shah, Yoshua Bengio

TL;DR

The paper tackles the challenge of vanishing and exploding gradients in RNNs by introducing Unitary Evolution RNNs (uRNNs) that maintain norm-preserving hidden-to-hidden dynamics via a structured, efficient unitary parameterization built from simple blocks. By operating in the complex domain and implementing using real-valued computations, the approach enables very large hidden states with manageable cost. The authors demonstrate that uRNNs achieve strong performance on tasks requiring long-term memory, often surpassing LSTMs and outpacing other orthogonal-init models, while offering insights into gradient propagation and saturation. These contributions suggest a scalable path for modeling long-range dependencies in sequential data using unitary, norm-preserving recurrent architectures.

Abstract

Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

Unitary Evolution Recurrent Neural Networks

TL;DR

The paper tackles the challenge of vanishing and exploding gradients in RNNs by introducing Unitary Evolution RNNs (uRNNs) that maintain norm-preserving hidden-to-hidden dynamics via a structured, efficient unitary parameterization built from simple blocks. By operating in the complex domain and implementing using real-valued computations, the approach enables very large hidden states with manageable cost. The authors demonstrate that uRNNs achieve strong performance on tasks requiring long-term memory, often surpassing LSTMs and outpacing other orthogonal-init models, while offering insights into gradient propagation and saturation. These contributions suggest a scalable path for modeling long-range dependencies in sequential data using unitary, norm-preserving recurrent architectures.

Abstract

Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

Paper Structure

This paper contains 13 sections, 1 theorem, 9 equations, 4 figures.

Key Result

Lemma 1

A complex square matrix $\mathbf{W}$ is unitary if and only if it has an eigendecomposition of the form $\mathbf{W} = \mathbf{V} \mathbf{D} \mathbf{V}^*$, where $^*$ denotes the conjugate transpose. Here, $\mathbf{V}, \mathbf{D} \in \mathbb{C}^{n \times n}$ are complex matrices, where $\mathbf{V}$ i

Figures (4)

  • Figure 1: Results of the copying memory problem for time lags of $100, 200, 300, 500$. The LSTM is able to beat the baseline only for $100$ times steps. Conversely the uRNN is able to completely solve each time length in very few training iterations, without getting stuck at the baseline.
  • Figure 2: Results of the adding problem for $T=100, 200, 400, 750$. The RNN with tanh is not able to beat the baseline for any time length. The LSTM and the uRNN show similar performance across time lengths, consistently beating the baseline.
  • Figure 3: Results on pixel by pixel MNIST classification tasks. The uRNN is able to converge in a fraction of the iterations that the LSTM requires. The LSTM performs better on MNIST classification, but the uRNN outperforms on the more complicated task of permuted pixels.
  • Figure 4: From left to right. Norms of the gradients with respect to hidden states i.e. $\left\lVert\frac{\partial C}{\partial h_t}\right\rVert$ at (i) beginning of training, (ii) after 100 iterations. (iii) Norms of the hidden states and (iv) $L_2$ distance between hidden states and final hidden state. The gradient norms of uRNNs do not decay as fast as for other models as training progresses. uRNN hidden state norms stay much more consistent over time than the LSTM. LSTM hidden states stay almost the same after a number of time steps, suggesting that it is not able to use new input information.

Theorems & Definitions (1)

  • Lemma 1