Table of Contents
Fetching ...

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

Ismail Rasim Ulgen, Zongyang Du, Carlos Busso, Berrak Sisman

TL;DR

The paper tackles the problem of limited labeled data in speech emotion recognition by revealing that emotion information is directly encoded in state-of-the-art speaker embeddings through intra-speaker clusters. It introduces an utterance-level, contrastive pretraining method that leverages emotion-unlabeled data by forming positive and negative pairs from the intra-speaker clusters, and it explores a multi-task extension with speaker classification (and adversarial variants). Empirical results on IEMOCAP, CREMA-D, ESD, and RAVDESS show that contrastive pretraining, especially in a multi-task setup, improves SER performance and that leveraging unlabeled data with this strategy can outperform strong baselines, including wav2vec2.0 pretraining. The findings offer a data-efficient approach to SER and deepen the understanding of the link between emotion and speaker representations, with implications for robust emotion-aware systems in low-label regimes.

Abstract

Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting.

Revealing Emotional Clusters in Speaker Embeddings: A Contrastive Learning Strategy for Speech Emotion Recognition

TL;DR

The paper tackles the problem of limited labeled data in speech emotion recognition by revealing that emotion information is directly encoded in state-of-the-art speaker embeddings through intra-speaker clusters. It introduces an utterance-level, contrastive pretraining method that leverages emotion-unlabeled data by forming positive and negative pairs from the intra-speaker clusters, and it explores a multi-task extension with speaker classification (and adversarial variants). Empirical results on IEMOCAP, CREMA-D, ESD, and RAVDESS show that contrastive pretraining, especially in a multi-task setup, improves SER performance and that leveraging unlabeled data with this strategy can outperform strong baselines, including wav2vec2.0 pretraining. The findings offer a data-efficient approach to SER and deepen the understanding of the link between emotion and speaker representations, with implications for robust emotion-aware systems in low-label regimes.

Abstract

Speaker embeddings carry valuable emotion-related information, which makes them a promising resource for enhancing speech emotion recognition (SER), especially with limited labeled data. Traditionally, it has been assumed that emotion information is indirectly embedded within speaker embeddings, leading to their under-utilization. Our study reveals a direct and useful link between emotion and state-of-the-art speaker embeddings in the form of intra-speaker clusters. By conducting a thorough clustering analysis, we demonstrate that emotion information can be readily extracted from speaker embeddings. In order to leverage this information, we introduce a novel contrastive pretraining approach applied to emotion-unlabeled data for speech emotion recognition. The proposed approach involves the sampling of positive and the negative examples based on the intra-speaker clusters of speaker embeddings. The proposed strategy, which leverages extensive emotion-unlabeled data, leads to a significant improvement in SER performance, whether employed as a standalone pretraining task or integrated into a multi-task pretraining setting.
Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Visualization of intra-speaker clusters in two datasets, the colors represent {speaker id}_{emotion}.
  • Figure 2: a) Proposed contrastive pre-training and SER training, b) Proposed multi-task learning framework.
  • Figure 3: The encoder architecture utilized in the networks.