Multimodal Gender Fairness in Depression Prediction: Insights on Data from the USA & China
Joseph Cameron, Jiaee Cheong, Micol Spitale, Hatice Gunes
TL;DR
This work tackles bias and fairness in ML-driven depression detection by comparing multimodal models trained on two culturally distinct datasets: CMDC from China and E-DAIC from the USA. It evaluates acoustic, visual, and textual modalities with multiple classifiers under stratified cross-validation, reporting both performance metrics (accuracy, F1, AUROC) and fairness measures ($EA_{Gender}$ and $DI_{Gender}$). The findings show significant cross-dataset differences in how depression manifests in features and in model fairness, though it remains unclear whether these arise from cultural differences or divergent data collection methods. The study advocates for consistent, culturally aware data collection protocols to mitigate ML bias in mental-health systems and to support fairer, human-centered wellbeing agents. This work advances understanding of cross-country fairness and highlights practical considerations for deploying multimodal depression detectors in diverse populations.
Abstract
Social agents and robots are increasingly being used in wellbeing settings. However, a key challenge is that these agents and robots typically rely on machine learning (ML) algorithms to detect and analyse an individual's mental wellbeing. The problem of bias and fairness in ML algorithms is becoming an increasingly greater source of concern. In concurrence, existing literature has also indicated that mental health conditions can manifest differently across genders and cultures. We hypothesise that the representation of features (acoustic, textual, and visual) and their inter-modal relations would vary among subjects from different cultures and genders, thus impacting the performance and fairness of various ML models. We present the very first evaluation of multimodal gender fairness in depression manifestation by undertaking a study on two different datasets from the USA and China. We undertake thorough statistical and ML experimentation and repeat the experiments for several different algorithms to ensure that the results are not algorithm-dependent. Our findings indicate that though there are differences between both datasets, it is not conclusive whether this is due to the difference in depression manifestation as hypothesised or other external factors such as differences in data collection methodology. Our findings further motivate a call for a more consistent and culturally aware data collection process in order to address the problem of ML bias in depression detection and to promote the development of fairer agents and robots for wellbeing.
