CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition
Yun-Shao Tsai, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
TL;DR
CO-VADA addresses speaker-related bias in speech emotion recognition without requiring demographic annotations or model modifications. It uses an early-stopped classifier to identify bias-guiding versus bias-contrary samples based on the losses $\mathcal{L}_{\mathrm{CE}}$ and $\mathcal{L}_{\mathrm{GCE}}$, and augments data by voice-converting bias-guiding samples to adopt underrepresented speaker traits while preserving emotion. The approach is model- and VC-tool-agnostic and demonstrates improved fairness (lower $\mathrm{TPR}_{\text{gap}}$ and $\mathrm{DP}_{\text{gap}}$) with competitive Macro-F1 across CREMA-D, MSP-Podcast, and MSP-IMPROV, validated through extensive ablations. This makes CO-VADA a scalable solution for fair SER in real-world settings where demographic metadata is unavailable or unreliable.
Abstract
Bias in speech emotion recognition (SER) systems often stems from spurious correlations between speaker characteristics and emotional labels, leading to unfair predictions across demographic groups. Many existing debiasing methods require model-specific changes or demographic annotations, limiting their practical use. We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information. CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples. These augmented samples introduce speaker variations that differ from dominant patterns in the data, guiding the model to focus more on emotion-relevant features. Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
