Table of Contents
Fetching ...

Variable Computation in Recurrent Neural Networks

Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov

TL;DR

This work introduces Variable Computation Units (VCUs) for recurrent networks to dynamically adjust computation at each time step based on the current state and input. By pairing a scheduler with a partial update mechanism, the authors instantiate two models, the VCRNN and VCGRU, that often achieve better predictive performance with fewer operations than fixed-computation counterparts. Across music and language modeling tasks, VCUs learn interpretable time-scale patterns, such as allocating more effort to fast-changing segments and less to uninformative regions, leading to efficiency and accuracy gains. The results suggest adaptive computation as a practical path toward more resource-efficient sequence models and motivate extensions to other recurrent architectures and training signals.

Abstract

Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only do our models require fewer operations, they also lead to better performance overall on evaluation tasks.

Variable Computation in Recurrent Neural Networks

TL;DR

This work introduces Variable Computation Units (VCUs) for recurrent networks to dynamically adjust computation at each time step based on the current state and input. By pairing a scheduler with a partial update mechanism, the authors instantiate two models, the VCRNN and VCGRU, that often achieve better predictive performance with fewer operations than fixed-computation counterparts. Across music and language modeling tasks, VCUs learn interpretable time-scale patterns, such as allocating more effort to fast-changing segments and less to uninformative regions, leading to efficiency and accuracy gains. The results suggest adaptive computation as a practical path toward more resource-efficient sequence models and motivate extensions to other recurrent architectures and training signals.

Abstract

Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only do our models require fewer operations, they also lead to better performance overall on evaluation tasks.

Paper Structure

This paper contains 22 sections, 17 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Two time steps of a VCU. At each step $t$, the scheduler takes in the current hidden vector $h_{t-1}$ and input vector $x_t$ and decides on a number of dimensions to use $d$. The unit then uses the first $d$ dimensions of $h_{t-1}$ and $x_t$ to compute the first $d$ elements of the new hidden state $h_t$, and carries the remaining $D-d$ dimensions over from $h_{t-1}$.
  • Figure 2: Top: Per-bit computation by VCRNN, higher dimensions (950 to 1000). Middle: adding 8 bits of buffer between every character. Bottom: adding 24 bits of buffer between each character.
  • Figure 3: Per-character computation by VCRNN. Top: English. Middle: Czech. Bottom: German. All languages learn to make use of word units.
  • Figure 4: Bits per character for different computational loads on the Europarl Czech (left) and German (right) datasets. The VCRNN, whether guided to use boundaries or fully unsupervised, achieves better held-out log-likelihood more efficiently than the standard RNN.
  • Figure 5: Per-character computation by VCRNN. The model appears to make use of morphology, separating sub-word units.