Table of Contents
Fetching ...

Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network

Shaokai Li, Yixuan Ji, Peng Song, Haoqin Sun, Wenming Zheng

TL;DR

A novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem, which aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion.

Abstract

In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing ones.

Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network

TL;DR

A novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem, which aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion.

Abstract

In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing ones.

Paper Structure

This paper contains 14 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The framework of FDAN. The blue and yellow parts represent the learning process of visual and acoustic features respectively. The purple part represents the cross-attention mechanism. $X_v$ and $X_a$ represent visual and acoustic features extracted by the pre-trained ResNet-34 respectively. By minimizing the LMMD loss, the feature distribution discrepancy between the coupled feature subspaces $Z_v^i$ and $Z_a^i$ in the $i$-th layer is reduced, where $i\in l$. $Y_v$ and $Y_a$ represent the true labels for the visual and acoustic samples, and $\hat{Y}_v$ and $\hat{Y}_a$ represent the corresponding predicted labels.
  • Figure 2: The structure of the cross-attention module.
  • Figure 3: The framework of the training and testing process of the FDAN model.
  • Figure 4: Confusion matrices of our model. The horizontal axis represents the predicted label, and the vertical axis represents the true label (AN: anger, DI: disgust, FE: fear, HA: happiness, NE: neutral, SA: sadness, and SU: surprise).
  • Figure 5: The t-SNE data visualization results. The $\textbf{+}$ and $\circ$ represent the visual and acoustic samples, respectively, and different colors represent different emotion categories (AN: anger, DI: disgust, FE: fear, HA: happiness, NE: neutral, SA: sadness, and SU: surprise).