Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models
Mohammad Mehdi Hosseini, Ali Pourramezan Fard, Mohammad H. Mahoor
TL;DR
This work addresses bias and fairness in facial expression recognition by conducting a holistic analysis of four in-the-wild FER datasets (AffectNet, ExpW, Fer2013, RAF-DB) and eight architectures (CNNs, Vision Transformers, CLIP, GPT-4o-mini, and FER-specialized models). It introduces a unified bias assessment framework using seven metrics to probe dataset bias and four fairness metrics to quantify model bias, including cross-dataset generalization tests such as leave-one-dataset-out. Key findings show that AffectNet and ExpW can generalize across datasets despite imbalances, while high-accuracy models like GPT-4o-mini and ViT also exhibit substantial bias; ResNet and XceptionNet tend to be more robust to bias, and FER-specialized models do not fully mitigate fairness concerns. The results stress the need for fairness-aware data curation and training strategies to ensure equitable FER deployment in real-world settings, with future work exploring non-demographic biases, fairness constraints, and multi-modal approaches.
Abstract
Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.
