Table of Contents
Fetching ...

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Berrak Sisman, Bjorn W. Schuller, Carlos Busso

TL;DR

EmoDARTS is presented, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance and outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM.

Abstract

Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.

emoDARTS: Joint Optimisation of CNN & Sequential Neural Network Architectures for Superior Speech Emotion Recognition

TL;DR

EmoDARTS is presented, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance and outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM.

Abstract

Speech Emotion Recognition (SER) is crucial for enabling computers to understand the emotions conveyed in human communication. With recent advancements in Deep Learning (DL), the performance of SER models has significantly improved. However, designing an optimal DL architecture requires specialised knowledge and experimental assessments. Fortunately, Neural Architecture Search (NAS) provides a potential solution for automatically determining the best DL model. The Differentiable Architecture Search (DARTS) is a particularly efficient method for discovering optimal models. This study presents emoDARTS, a DARTS-optimised joint CNN and Sequential Neural Network (SeqNN: LSTM, RNN) architecture that enhances SER performance. The literature supports the selection of CNN and LSTM coupling to improve performance. While DARTS has previously been used to choose CNN and LSTM operations independently, our technique adds a novel mechanism for selecting CNN and SeqNN operations in conjunction using DARTS. Unlike earlier work, we do not impose limits on the layer order of the CNN. Instead, we let DARTS choose the best layer order inside the DARTS cell. We demonstrate that emoDARTS outperforms conventionally designed CNN-LSTM models and surpasses the best-reported SER results achieved through DARTS on CNN-LSTM by evaluating our approach on the IEMOCAP, MSP-IMPROV, and MSP-Podcast datasets.
Paper Structure (18 sections, 4 equations, 10 figures, 7 tables)

This paper contains 18 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The proposed architecture of emoDARTS passes the input features to the CNN component through the SeqNN component and finally to a dense layer. The optimum CNN and SeqNN operations are selected by DARTS jointly.
  • Figure 2: The emoDARTS architecture comprises input features processed through CNN, SeqNN, and Dense layers and it utilises DARTS for jointly optimising the CNN and SeqNN components.
  • Figure 3: DARTS employs steps (a) to (d) to search cell architectures: (a) initialises the graph, (b) forms a search space, (c) updates edge weights, and (d) determines the final cell structure. Nodes signify representations, edges represent operations, with light-coloured edges indicating weaker and dark-coloured edges representing stronger operations.
  • Figure 4: Visualisation of the CNN+LSTM attention baseline model. The parameters of the CNN layer are: kernel size (k)=2, stride (s)=2 and, padding (p)=2 and the parameters of the Max-pooling layer are: kernel size (k)=2 and stride (s)=2 and the LSTM layer has 128 units.
  • Figure 5: Comparison of UA% between the datasets the NAS generated (emoDARTS) and CNN+LSTM attention models developed without DARTS (w/o DARTS)
  • ...and 5 more figures