Table of Contents
Fetching ...

Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit

Gabriela Sejnova, Michal Vavrecka, Karla Stepanova, Tadahiro Taniguchi

TL;DR

A toolkit for systematic multimodal VAE training and comparison is proposed and a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels is presented.

Abstract

Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.

Benchmarking Multimodal Variational Autoencoders: CdSprites+ Dataset and Toolkit

TL;DR

A toolkit for systematic multimodal VAE training and comparison is proposed and a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels is presented.

Abstract

Multimodal Variational Autoencoders (VAEs) have been the subject of intense research in the past years as they can integrate multiple modalities into a joint representation and can thus serve as a promising tool for both data classification and generation. Several approaches toward multimodal VAE learning have been proposed so far, their comparison and evaluation have however been rather inconsistent. One reason is that the models differ at the implementation level, another problem is that the datasets commonly used in these cases were not initially designed to evaluate multimodal generative models. This paper addresses both mentioned issues. First, we propose a toolkit for systematic multimodal VAE training and comparison. The toolkit currently comprises 4 existing multimodal VAEs and 6 commonly used benchmark datasets along with instructions on how to easily add a new model or a dataset. Second, we present a disentangled bimodal dataset designed to comprehensively evaluate the joint generation and cross-generation capabilities across multiple difficulty levels. We demonstrate the utility of our dataset by comparing the implemented state-of-the-art models.
Paper Structure (29 sections, 9 figures, 11 tables)

This paper contains 29 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Examples of our proposed CdSprites+ dataset. The dataset contains RGB images (left columns) and their textual descriptions (right columns). We provide 5 levels of difficulty (left to right). Level 1 only varies the shape attribute, Level 2 varies shape and size, Level 3 varies also the colour attribute, Level 4 varies the position and Level 5 varies also the background shade. See a more detailed description of the dataset in Section \ref{['datasetdesc']}.
  • Figure 2: Qualitative results of the MVAE, MMVAE, MoPoE and DMVAE models trained on Level 1, 3 and 5 of our CdSprites+ dataset. We show first the reconstructions of the input image, then the captions obtained by cross-sampling.
  • Figure 3: Results for the MVAE and MMVAE models trained on the MNIST-SVHN dataset using our toolkit. For MMVAE, we used the DREG objective as proposed by the authors, MVAE was trained with ELBO. We used the encoder and decoder networks from the original implementations. The top figures are traversals for each modality, below we show cross-generated samples. The bottom figures are T-SNE visualizations of the latent space - please note that for MVAE we show samples from the single joint posterior, while for MMVAE we show samples for both modality-specific distributions.
  • Figure 4: T-SNE visualizations for the MVAE model's (16-D) joint latent space trained on CdSprites+ Level 4. We show the latent space for each of the 4 features (size, shape, position and colour) individually.
  • Figure 5: T-SNE visualizations for the MMVAE model's (24-D) unimodal latent spaces trained on CdSprites+ level 4. We show the latent space for each of the 4 features (size, shape, position and colour) individually.
  • ...and 4 more figures