TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition

Vlad Striletchi; Cosmin Striletchi; Adriana Stan

TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition

Vlad Striletchi, Cosmin Striletchi, Adriana Stan

TL;DR

A comprehensive performance evaluation of TBDM-Net is conducted, including an ablation study, across six widely-acknowledged SER datasets for unimodal speech emotion recognition.

Abstract

This paper presents a novel deep neural network-based architecture tailored for Speech Emotion Recognition (SER). The architecture capitalises on dense interconnections among multiple layers of bidirectional dilated convolutions. A linear kernel dynamically fuses the outputs of these layers to yield the final emotion class prediction. This innovative architecture is denoted as TBDM-Net: Temporally-Aware Bi-directional Dense Multi-Scale Network. We conduct a comprehensive performance evaluation of TBDM-Net, including an ablation study, across six widely-acknowledged SER datasets for unimodal speech emotion recognition. Additionally, we explore the influence of gender-informed emotion prediction by appending either golden or predicted gender labels to the architecture's inputs or predictions. The implementation of TBDM-Net is accessible at: https://github.com/adrianastan/tbdm-net

TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition

TL;DR

A comprehensive performance evaluation of TBDM-Net is conducted, including an ablation study, across six widely-acknowledged SER datasets for unimodal speech emotion recognition.

Abstract

Paper Structure (9 sections, 1 figure, 6 tables)

This paper contains 9 sections, 1 figure, 6 tables.

Introduction
TBDM-Net Architecture
Evaluation
Speech datasets and features
Objective measures and training procedure
Baseline results
Ablation study
Gender-informed results
Conclusions

Figures (1)

Figure 1: The TBDM-Net architecture. The forward and reverse time speech representations are passed through a series of Temporally-Aware Blocks (TABs). The intermediate bidirectional representations are concatenated ($g_k$), passed through a dimension reduction convolutional block and averaged to obtain a final concatenation of different time-scale representations ($g_k'$). Each TABs input is passed forward to the next modules through dense concatenative connections. The final representation dynamically fuses the multi-scale representations ($g_{df}$), and is passed through a fully connected layer ($FC$) to output emotion class probabilities.

TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition

TL;DR

Abstract

TBDM-Net: Bidirectional Dense Networks with Gender Information for Speech Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (1)