Table of Contents
Fetching ...

Fusion approaches for emotion recognition from speech using acoustic and text-based features

Leonardo Pepino, Pablo Riera, Luciana Ferrer, Agustin Gravano

TL;DR

It is found that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches.

Abstract

In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.

Fusion approaches for emotion recognition from speech using acoustic and text-based features

TL;DR

It is found that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches.

Abstract

In this paper, we study different approaches for classifying emotions from speech using acoustic and text-based features. We propose to obtain contextualized word embeddings with BERT to represent the information contained in speech transcriptions and show that this results in better performance than using Glove embeddings. We also propose and compare different strategies to combine the audio and text modalities, evaluating them on IEMOCAP and MSP-PODCAST datasets. We find that fusing acoustic and text-based systems is beneficial on both datasets, though only subtle differences are observed across the evaluated fusion approaches. Finally, for IEMOCAP, we show the large effect that the criteria used to define the cross-validation folds have on results. In particular, the standard way of creating folds for this dataset results in a highly optimistic estimation of performance for the text-based system, suggesting that some previous works may overestimate the advantage of incorporating transcriptions.
Paper Structure (14 sections, 3 figures, 1 table)

This paper contains 14 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Text-based and audio-based architectures. $T_{text}$ and $T_{audio}$ are the sequence lengths of the model inputs and $D_{text}$ and $D_{audio}$ are the number of features for each input. $N_F$ is the number of convolutional filters, $S$ is the kernel size and $N_U$ is the number of neurons in dense layers. 1D-Convolutional layers operate on the time axis.
  • Figure 2: Effect of different criteria for defining the folds in IEMOCAP on audio- and text-based systems for two different model sizes (small and large). RAND: random folds, SP: by-speaker folds, SP&SC: by-speaker and by-script folds.
  • Figure 3: Results for IEMOCAP and MSP-PODCAST dataset. Average AUC distributions for 10 different initialization seeds for different systems: audio model, Glove and BERT based text models, early fusion with cold-start (EF-CS), pretraining (EF-PT) and warm-start (EF-WS) and late fusion with pretraining (LF-PT) models.