Table of Contents
Fetching ...

A Recurrent Latent Variable Model for Sequential Data

Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio

TL;DR

The paper addresses modelling highly structured sequential data by introducing a variational recurrent neural network (VRNN) that embeds latent random variables at each timestep. Each latent is drawn from a prior conditioned on the previous hidden state, while the generation and inference networks are conditioned on the latent and the RNN state, respectively, enabling a joint temporal Bayesian treatment with a variational objective. Empirically, VRNNs (including Gaussian and Gaussian mixture observation models) achieve higher log-likelihoods than strong RNN baselines on speech and handwriting tasks, with the temporal prior improving performance and samples showing reduced noise and more consistent handwriting style. This approach offers a principled way to capture multimodal, temporally coherent variability in sequential data, with potential impact on speech synthesis and other structured sequence domains.

Abstract

In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamic hidden state.

A Recurrent Latent Variable Model for Sequential Data

TL;DR

The paper addresses modelling highly structured sequential data by introducing a variational recurrent neural network (VRNN) that embeds latent random variables at each timestep. Each latent is drawn from a prior conditioned on the previous hidden state, while the generation and inference networks are conditioned on the latent and the RNN state, respectively, enabling a joint temporal Bayesian treatment with a variational objective. Empirically, VRNNs (including Gaussian and Gaussian mixture observation models) achieve higher log-likelihoods than strong RNN baselines on speech and handwriting tasks, with the temporal prior improving performance and samples showing reduced noise and more consistent handwriting style. This approach offers a principled way to capture multimodal, temporally coherent variability in sequential data, with potential impact on speech synthesis and other structured sequence domains.

Abstract

In this paper, we explore the inclusion of latent random variables into the dynamic hidden state of a recurrent neural network (RNN) by combining elements of the variational autoencoder. We argue that through the use of high-level latent random variables, the variational RNN (VRNN)1 can model the kind of variability observed in highly structured sequential data such as natural speech. We empirically evaluate the proposed model against related sequential models on four speech datasets and one handwriting dataset. Our results show the important roles that latent random variables can play in the RNN dynamic hidden state.

Paper Structure

This paper contains 18 sections, 13 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Graphical illustrations of each operation of the VRNN: (a) computing the conditional prior using Eq. \ref{['eq:vrnn_prior']}; (b) generating function using Eq. \ref{['eq:vrnn_gen']}; (c) updating the RNN hidden state using Eq. \ref{['eq:vrnn_trans']}; (d) inference of the approximate posterior using Eq. \ref{['eq:vrnn_posterior']}; (e) overall computational paths of the VRNN.
  • Figure 2: The top row represents the difference $\delta_t$ between $\boldsymbol{\mu}_{z,t}$ and $\boldsymbol{\mu}_{z,t-1}$. The middle row shows the dominant $\mathrm{KL}$ divergence values in temporal order. The bottom row shows the input waveforms.
  • Figure 3: Examples from the training set and generated samples from RNN-GMM and VRNN-Gauss. Top three rows show the global waveforms while the bottom three rows show more zoomed-in waveforms. Samples from (b) RNN-GMM contain high-frequency noise, and samples from (c) VRNN-Gauss have less noise. We exclude RNN-Gauss, because the samples are almost close to pure noise.
  • Figure 4: Handwriting samples: (a) training examples and unconditionally generated handwriting from (b) RNN-Gauss, (c) RNN-GMM and (d) VRNN-GMM. The VRNN-GMM retains the writing style from beginning to end while RNN-Gauss and RNN-GMM tend to change the writing style during the generation process. This is possibly because the sequential latent random variables can guide the model to generate each sample with a consistent writing style.