Table of Contents
Fetching ...

Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

Jonghwan Hyeon, Yung-Hwan Oh, Ho-Jin Choi

TL;DR

Segmental Average Pooling (SAP), a novel pooling technique designed to selectively emphasize informative verbal segments while disregarding non-verbal parts, is introduced, which leads to significant improvements in SER performance, in terms of both unweighted and weighted accuracies.

Abstract

Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and superior performance on KEMDy19 for Korean datasets in both unweighted and weighted accuracies.

Enhancing Speech Emotion Recognition through Segmental Average Pooling of Self-Supervised Learning Features

TL;DR

Segmental Average Pooling (SAP), a novel pooling technique designed to selectively emphasize informative verbal segments while disregarding non-verbal parts, is introduced, which leads to significant improvements in SER performance, in terms of both unweighted and weighted accuracies.

Abstract

Speech Emotion Recognition (SER) analyzes human emotions expressed through speech. Self-supervised learning (SSL) offers a promising approach to SER by learning meaningful representations from a large amount of unlabeled audio data. However, existing SSL-based methods rely on Global Average Pooling (GAP) to represent audio signals, treating speech and non-speech segments equally. This can lead to dilution of informative speech features by irrelevant non-speech information. To address this, the paper proposes Segmental Average Pooling (SAP), which selectively focuses on informative speech segments while ignoring non-speech segments. By applying both GAP and SAP to SSL features, our approach utilizes overall speech signal information from GAP and specific information from SAP, leading to improved SER performance. Experiments show state-of-the-art results on the IEMOCAP for English and superior performance on KEMDy19 for Korean datasets in both unweighted and weighted accuracies.

Paper Structure

This paper contains 16 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An overall architecture of our proposed approach
  • Figure 2: Confusion matrix on IEMOCAP and KEMDy19