Table of Contents
Fetching ...

Variational Recurrent Auto-Encoders

Otto Fabius, Joost R. van Amersfoort

TL;DR

This work introduces the Variational Recurrent Auto-Encoder (VRAE), which fuses RNNs with Stochastic Gradient Variational Bayes to learn latent representations of time-series data in an unsupervised, scalable manner and to generate sequence data. The encoder maps sequences to a Gaussian latent distribution, while the decoder reconstructs data from samples drawn via a reparameterized latent variable, enabling efficient gradient-based training. Experiments on MIDI-like data demonstrate both low-dimensional and high-dimensional latent spaces, revealing structured latent organization and the ability to interpolate and generate longer sequences. The authors also argue that VRAE provides useful initializations for standard RNNs, potentially improving training stability and performance for supervised tasks on sequential data.

Abstract

In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.

Variational Recurrent Auto-Encoders

TL;DR

This work introduces the Variational Recurrent Auto-Encoder (VRAE), which fuses RNNs with Stochastic Gradient Variational Bayes to learn latent representations of time-series data in an unsupervised, scalable manner and to generate sequence data. The encoder maps sequences to a Gaussian latent distribution, while the decoder reconstructs data from samples drawn via a reparameterized latent variable, enabling efficient gradient-based training. Experiments on MIDI-like data demonstrate both low-dimensional and high-dimensional latent spaces, revealing structured latent organization and the ability to interpolate and generate longer sequences. The authors also argue that VRAE provides useful initializations for standard RNNs, potentially improving training stability and performance for supervised tasks on sequential data.

Abstract

In this paper we propose a model that combines the strengths of RNNs and SGVB: the Variational Recurrent Auto-Encoder (VRAE). Such a model can be used for efficient, large scale unsupervised learning on time series data, mapping the time series data to a latent vector representation. The model is generative, such that data can be generated from samples of the latent space. An important contribution of this work is that the model can make use of unlabeled data in order to facilitate supervised training of RNNs by initialising the weights and network state.

Paper Structure

This paper contains 9 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: On the left is the lower bound of the log-likelihood per datapoint per time step during training. The first 10 epochs were cut off for scale reasons. On the right is the organisation of all data points in latent space. Each datapoint is encoded, and visualized at the location of the resulting two-dimensional mean $\mu$ of the encoding. "Mario Underworld" (green triangles), "Mario" (red triangles) and "Mariokart" (blue triangles) occupy the most distinct regions.
  • Figure 2: On the left is the lower bound of the log-likelihood per datapoint per time step during training. The first 10 epochs were cut off for scale reasons. On the right is a visualization of the organisation of the encoded data in latent space. We calculated the 20-dimensional latent representation is calculated for each data point. The mean $\mu$ of this representation is visualized in two dimensions using t-SNE. Each color represents the data points from one song. It can be seen that for each song, the parts of that song occupy only a part of the space and the parts of some songs (e.g. "mariounderworld", in purple), are clearly grouped together. Of course, how much the parts of one song can be grouped together depends on the homogeneity of the song relative to the similarity between the different songs, as well as on how much spatial information is lost during the dimensionality reduction of t-SNE.