ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Zi Haur Pang; Xiaoxue Gao; Tatsuya Kawahara; Nancy F. Chen

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara, Nancy F. Chen

Abstract

Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Abstract

Paper Structure (17 sections, 6 equations, 1 figure, 3 tables)

This paper contains 17 sections, 6 equations, 1 figure, 3 tables.

Introduction
Methodology
ERM-Based Supervised Fine-Tuning
MinMaxGAP Regularization
Adaptive Fairness Weight
Final Training Objective
Experimental Setup
Database
Training Hyperparameters
Comparison Models
Evaluation Metrics
Results and Analysis
Benchmarking Gender Bias in Multilingual Multimodal Speech LLMs
Effectiveness of ERM-MinMaxGAP
Ablation Study
...and 2 more sections

Figures (1)

Figure 1: Architecture of the proposed method. The method consists of (1) empirical risk minimization for overall SER improvement, (2) MinMaxGAP for minimizing the language-wise gender gap, and (3) adaptive fairness-weight adjustment for fairness-aware SER.

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Abstract

ERM-MinMaxGAP: Benchmarking and Mitigating Gender Bias in Multilingual Multimodal Speech-LLM Emotion Recognition

Authors

Abstract

Table of Contents

Figures (1)