Table of Contents
Fetching ...

CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

TL;DR

CO-VADA addresses speaker-related bias in speech emotion recognition without requiring demographic annotations or model modifications. It uses an early-stopped classifier to identify bias-guiding versus bias-contrary samples based on the losses $\mathcal{L}_{\mathrm{CE}}$ and $\mathcal{L}_{\mathrm{GCE}}$, and augments data by voice-converting bias-guiding samples to adopt underrepresented speaker traits while preserving emotion. The approach is model- and VC-tool-agnostic and demonstrates improved fairness (lower $\mathrm{TPR}_{\text{gap}}$ and $\mathrm{DP}_{\text{gap}}$) with competitive Macro-F1 across CREMA-D, MSP-Podcast, and MSP-IMPROV, validated through extensive ablations. This makes CO-VADA a scalable solution for fair SER in real-world settings where demographic metadata is unavailable or unreliable.

Abstract

Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition

TL;DR

CO-VADA addresses speaker-related bias in speech emotion recognition without requiring demographic annotations or model modifications. It uses an early-stopped classifier to identify bias-guiding versus bias-contrary samples based on the losses and , and augments data by voice-converting bias-guiding samples to adopt underrepresented speaker traits while preserving emotion. The approach is model- and VC-tool-agnostic and demonstrates improved fairness (lower and ) with competitive Macro-F1 across CREMA-D, MSP-Podcast, and MSP-IMPROV, validated through extensive ablations. This makes CO-VADA a scalable solution for fair SER in real-world settings where demographic metadata is unavailable or unreliable.

Abstract

Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.

Paper Structure

This paper contains 26 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the proposed CO-VADA. $\mathcal{L}_{\mathrm{CE}}$ denotes the standard cross-entropy loss, and $\mathcal{L}_{\mathrm{GCE}}$ denotes the generalized cross-entropy loss used during early-stopped training. Category refers to the emotion classes used as prediction targets, while Group represents the speaker subgroup. After voice conversion, the resulting utterance retains the emotional content of the bias-guiding sample while adopting the speaker identity of the bias-contrary sample.
  • Figure 2: Performance-Fairness plots for the CREMA-D across gender, race, and age. Points near the lower-right corner indicate the best trade-off between fairness and performance.
  • Figure 3: Performance-Fairness plots for the MSP-Podcast and MSP-IMPROV (gender only). Points closer to the lower-right corner reflect better overall trade-offs between performance and fairness.