Table of Contents
Fetching ...

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution

Xiaohui Zhang, Wenjie Fu, Mangui Liang

TL;DR

Sinc-convolution layer is utilized, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM) and incorporate linguistic features and append a dialogical emotion decoding (DED) strategy.

Abstract

Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform for training neural network is becoming an emerging trend. This approach is advantageous as it eliminates the feature extraction pipeline. Learning from time-domain signal has shown good results for tasks such as speech recognition, speaker verification etc. In this paper, we utilize Sinc-convolution layer, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM). We also incorporate linguistic features and append a dialogical emotion decoding (DED) strategy. Our approach achieves a weighted accuracy of 85.1\% in four class emotion on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset.

Multimodal Emotion Recognition from Raw Audio with Sinc-convolution

TL;DR

Sinc-convolution layer is utilized, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM) and incorporate linguistic features and append a dialogical emotion decoding (DED) strategy.

Abstract

Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform for training neural network is becoming an emerging trend. This approach is advantageous as it eliminates the feature extraction pipeline. Learning from time-domain signal has shown good results for tasks such as speech recognition, speaker verification etc. In this paper, we utilize Sinc-convolution layer, which is an efficient architecture for preprocessing raw speech waveform for emotion recognition, to extract acoustic features from raw audio signals followed by a long short-term memory (LSTM). We also incorporate linguistic features and append a dialogical emotion decoding (DED) strategy. Our approach achieves a weighted accuracy of 85.1\% in four class emotion on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset.
Paper Structure (10 sections, 4 equations, 4 figures, 2 tables)

This paper contains 10 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Architecture of Sinc-conv layer.
  • Figure 2: Sentence Error Rate of CNN, Sinc-DNN and Sinc-LSTM over various training epochs
  • Figure 3: The Architecture of fusing model combining Sinc-LSTM and LSTM with DED post-processing
  • Figure 4: The performance of DED with different pre-classifiers