Table of Contents
Fetching ...

A vector quantized masked autoencoder for audiovisual speech emotion recognition

Samir Sadok, Simon Leglaive, Renaud Séguier

TL;DR

This work introduces VQ-MAE-AV, a self-supervised, multimodal masked autoencoder for audiovisual speech emotion recognition. It leverages two frozen VQ-VAEs to produce discrete audio and visual tokens, applies coupled masking, and uses an attention-based encoder–decoder with global tokens to fuse modalities. The model is pre-trained on unlabeled audiovisual data with a dual loss: generative token reconstruction $\mathcal{L}_{gen}$ and contrastive $\mathcal{L}_{NCE}$ between audio and visual global tokens, then fine-tuned on emotion-labeled datasets with three fusion-head strategies, notably Query2Emo. Empirical results show state-of-the-art SER performance across controlled and in-the-wild datasets, with ablations highlighting the importance of cross-attention fusion, joint losses, and multimodal integration for robust emotion recognition.

Abstract

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.

A vector quantized masked autoencoder for audiovisual speech emotion recognition

TL;DR

This work introduces VQ-MAE-AV, a self-supervised, multimodal masked autoencoder for audiovisual speech emotion recognition. It leverages two frozen VQ-VAEs to produce discrete audio and visual tokens, applies coupled masking, and uses an attention-based encoder–decoder with global tokens to fuse modalities. The model is pre-trained on unlabeled audiovisual data with a dual loss: generative token reconstruction and contrastive between audio and visual global tokens, then fine-tuned on emotion-labeled datasets with three fusion-head strategies, notably Query2Emo. Empirical results show state-of-the-art SER performance across controlled and in-the-wild datasets, with ablations highlighting the importance of cross-attention fusion, joint losses, and multimodal integration for robust emotion recognition.

Abstract

An important challenge in emotion recognition is to develop methods that can leverage unlabeled training data. In this paper, we propose the VQ-MAE-AV model, a self-supervised multimodal model that leverages masked autoencoders to learn representations of audiovisual speech without labels. The model includes vector quantized variational autoencoders that compress raw audio and visual speech data into discrete tokens. The audiovisual speech tokens are used to train a multimodal masked autoencoder that consists of an encoder-decoder architecture with attention mechanisms. The model is designed to extract both local (i.e., at the frame level) and global (i.e., at the sequence level) representations of audiovisual speech. During self-supervised pre-training, the VQ-MAE-AV model is trained on a large-scale unlabeled dataset of audiovisual speech, for the task of reconstructing randomly masked audiovisual speech tokens and with a contrastive learning strategy. During this pre-training, the encoder learns to extract a representation of audiovisual speech that can be subsequently leveraged for emotion recognition. During the supervised fine-tuning stage, a small classification model is trained on top of the VQ-MAE-AV encoder for an emotion recognition task. The proposed approach achieves state-of-the-art emotion recognition results across several datasets in both controlled and in-the-wild conditions.
Paper Structure (47 sections, 4 equations, 8 figures, 8 tables)

This paper contains 47 sections, 4 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Discrete audio and visual tokens creation: (i) fully-convolutional VQ-VAEs are trained independently on the audio and visual modalities (see Section \ref{['subsec:VQVAE']}); (ii) discrete audio and visual tokens are built from the quantized representations provided by the frozen VQ-VAE encoders (see Section \ref{['subsec:discrete_tokens']}).
  • Figure 1: Audiovisual emotion recognition results in terms of accuracy (%) and F1 score (%) on the RAVDESS and CREMA-D datasets
  • Figure 2: VQ-MAE-AV model structure. See the first paragraph of Section \ref{['sec:VQ-MAE-AV']} for a complete description of the pipeline.
  • Figure 2: Accuracy (%) and F1 score (%) results on DFEW and Aff-Wild2. The best scores are in bold, and the second-best scores are underlined.
  • Figure 3: Overview of the three emotion recognition models trained on top of the VQ-MAE-AV encoder.
  • ...and 3 more figures