Table of Contents
Fetching ...

Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation

Cheng Lu, Yuan Zong, Hailun Lian, Yan Zhao, Björn Schuller, Wenming Zheng

TL;DR

The paper tackles speaker-independent speech emotion recognition (SER) under multi-speaker distribution shifts that degrade performance on unseen speakers. It introduces Dynamic Joint Distribution Adaptation (DJDA), which integrates Marginal Distribution Alignment (MDA) and Class-wise Conditional Distribution Alignment (CDA) within a multi-source domain adaptation framework, guided by a dynamic balance factor based on the $ ext{A}$-distance. A dynamic weight $w=1- rac{d_{md}}{d_{md}+ extstyle\sum_{m=1}^{c}d^{m}_{cd}}$ balances MDA and CDA, with $d_{md}=2(1-2\mathcal{L}_{m})$ and $d^{m}_{cd}=2(1-2(\mathcal{L}_{cst}^{m}+\mathcal{L}_{csp}^{m}))$, and total loss $\mathcal{L}_{total}=\mathcal{L}_{ce}-\eta\big((1-w)\mathcal{L}_{md}+w\mathcal{L}_{cd}\big)$. Experiments on IEMOCAP and Emo-DB show DJDA achieving state-of-the-art performance, with ablations confirming the contributions of both MDA and CDA and the effectiveness of the dynamic balancing strategy for handling unknown target distributions across new speakers.

Abstract

In speaker-independent speech emotion recognition, the training and testing samples are collected from diverse speakers, leading to a multi-domain shift challenge across the feature distributions of data from different speakers. Consequently, when the trained model is confronted with data from new speakers, its performance tends to degrade. To address the issue, we propose a Dynamic Joint Distribution Adaptation (DJDA) method under the framework of multi-source domain adaptation. DJDA firstly utilizes joint distribution adaptation (JDA), involving marginal distribution adaptation (MDA) and conditional distribution adaptation (CDA), to more precisely measure the multi-domain distribution shifts caused by different speakers. This helps eliminate speaker bias in emotion features, allowing for learning discriminative and speaker-invariant speech emotion features from coarse-level to fine-level. Furthermore, we quantify the adaptation contributions of MDA and CDA within JDA by using a dynamic balance factor based on $\mathcal{A}$-Distance, promoting to effectively handle the unknown distributions encountered in data from new speakers. Experimental results demonstrate the superior performance of our DJDA as compared to other state-of-the-art (SOTA) methods.

Improving Speaker-independent Speech Emotion Recognition Using Dynamic Joint Distribution Adaptation

TL;DR

The paper tackles speaker-independent speech emotion recognition (SER) under multi-speaker distribution shifts that degrade performance on unseen speakers. It introduces Dynamic Joint Distribution Adaptation (DJDA), which integrates Marginal Distribution Alignment (MDA) and Class-wise Conditional Distribution Alignment (CDA) within a multi-source domain adaptation framework, guided by a dynamic balance factor based on the -distance. A dynamic weight balances MDA and CDA, with and , and total loss . Experiments on IEMOCAP and Emo-DB show DJDA achieving state-of-the-art performance, with ablations confirming the contributions of both MDA and CDA and the effectiveness of the dynamic balancing strategy for handling unknown target distributions across new speakers.

Abstract

In speaker-independent speech emotion recognition, the training and testing samples are collected from diverse speakers, leading to a multi-domain shift challenge across the feature distributions of data from different speakers. Consequently, when the trained model is confronted with data from new speakers, its performance tends to degrade. To address the issue, we propose a Dynamic Joint Distribution Adaptation (DJDA) method under the framework of multi-source domain adaptation. DJDA firstly utilizes joint distribution adaptation (JDA), involving marginal distribution adaptation (MDA) and conditional distribution adaptation (CDA), to more precisely measure the multi-domain distribution shifts caused by different speakers. This helps eliminate speaker bias in emotion features, allowing for learning discriminative and speaker-invariant speech emotion features from coarse-level to fine-level. Furthermore, we quantify the adaptation contributions of MDA and CDA within JDA by using a dynamic balance factor based on -Distance, promoting to effectively handle the unknown distributions encountered in data from new speakers. Experimental results demonstrate the superior performance of our DJDA as compared to other state-of-the-art (SOTA) methods.
Paper Structure (14 sections, 6 equations, 3 figures, 2 tables)

This paper contains 14 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of Dynamic Joint Distribution Adaptation (DJDA) method for speaker-independent SER, which primarily consists of JDA encompassing MDA and CDA, as well as dynamic JDA strategy.
  • Figure 2: Ablation experiments of DJDA on Emo-DB.
  • Figure 3: Dynamic balance factors for different target samples.