A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Samir Sadok; Simon Leglaive; Laurent Girin; Xavier Alameda-Pineda; Renaud Séguier

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Samir Sadok, Simon Leglaive, Laurent Girin, Xavier Alameda-Pineda, Renaud Séguier

TL;DR

The paper tackles audiovisual speech representation learning under limited labeled data by proposing MDVAE, a hierarchical, multimodal, and dynamical variational autoencoder. It carefully separates static information (e.g., speaker identity and global emotion) from dynamic content (shared audiovisual dynamics and modality-specific motion) and introduces a two-stage training procedure with VQ-VAE pretraining to improve reconstruction and latent disentanglement. Empirical results on MEAD demonstrate that the static latent $w$ supports strong emotion recognition with few labels, while qualitative and quantitative analyses reveal clear mappings: $z^{(av)}$ captures lip movements and formants, $z^{(a)}$ and $z^{(v)}$ encode modality-specific dynamics, and $w$ encodes identity and global emotion. The approach yields robust audiovisual fusion, effective denoising leveraging the audio channel, and competitive or superior emotion recognition performance compared with supervised baselines, highlighting its potential for unsupervised multimodal synthesis and analysis in speech processing.

Abstract

In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

TL;DR

supports strong emotion recognition with few labels, while qualitative and quantitative analyses reveal clear mappings:

captures lip movements and formants,

and

encode modality-specific dynamics, and

encodes identity and global emotion. The approach yields robust audiovisual fusion, effective denoising leveraging the audio channel, and competitive or superior emotion recognition performance compared with supervised baselines, highlighting its potential for unsupervised multimodal synthesis and analysis in speech processing.

Abstract

Paper Structure (21 sections, 9 equations, 16 figures, 7 tables)

This paper contains 21 sections, 9 equations, 16 figures, 7 tables.

Introduction and related work
Multimodal Dynamical VAE
Motivation and notations
Generative model
Inference model
Training
Two-stage training
Experiments on audiovisual speech
Expressive audiovisual speech dataset
Training VQ-MDVAE
Analysis-resynthesis
Analysis-transformation-synthesis
Qualitative results
Quantitative Results
Audiovisual facial image denoising
...and 6 more sections

Figures (16)

Figure 1: MDVAE generative probabilistic graphical model.
Figure 2: MDVAE inference probabilistic graphical model.
Figure 2: Speech performance of the MDVAE model tested in the analysis-resynthesis experiment. The STOI, PESQ, and MOSnet scores are averaged over the test subset of the MEAD dataset.
Figure 3: The overall architecture of VQ-MDVAE. During the first step of the training process, we learn a VQ-VAE independently on each modality, without any temporal modeling. During the second step of the training process, we learn the MDVAE model on the latent representation provided by the frozen VQ-VAE encoders, before quantization.
Figure 4: Visual sequences generated using the analysis-transformation-synthesis experiment. The top two sequences depict original image sequences of two distinct individuals, while the bottom two sequences were generated by swapping the latent variable $\textcolor{Green}{\mathbf{w}}$ between the two original sequences.
...and 11 more figures

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

TL;DR

Abstract

A multimodal dynamical variational autoencoder for audiovisual speech representation learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)