Gender Stereotyping Impact in Facial Expression Recognition
Iris Dominguez-Catena, Daniel Paternain, Mikel Galar
TL;DR
This paper investigates how stereotypical gender bias embedded in FER datasets affects model performance. It constructs derivative FER+ datasets with controlled gender proportions across emotion labels and assesses recall disparities between apparent gender groups using a VGG11 baseline. The study reveals recall gaps up to 29% under extreme bias, identifies a safety range where bias does not significantly alter outcomes, and emphasizes the need for bias auditing and careful data curation in FER. The findings highlight that global demographic balance can conceal category-specific biases, underscoring the importance of dataset-level bias mitigation to prevent harms in human–AI interactions.
Abstract
Facial Expression Recognition (FER) uses images of faces to identify the emotional state of users, allowing for a closer interaction between humans and autonomous systems. Unfortunately, as the images naturally integrate some demographic information, such as apparent age, gender, and race of the subject, these systems are prone to demographic bias issues. In recent years, machine learning-based models have become the most popular approach to FER. These models require training on large datasets of facial expression images, and their generalization capabilities are strongly related to the characteristics of the dataset. In publicly available FER datasets, apparent gender representation is usually mostly balanced, but their representation in the individual label is not, embedding social stereotypes into the datasets and generating a potential for harm. Although this type of bias has been overlooked so far, it is important to understand the impact it may have in the context of FER. To do so, we use a popular FER dataset, FER+, to generate derivative datasets with different amounts of stereotypical bias by altering the gender proportions of certain labels. We then proceed to measure the discrepancy between the performance of the models trained on these datasets for the apparent gender groups. We observe a discrepancy in the recognition of certain emotions between genders of up to $29 \%$ under the worst bias conditions. Our results also suggest a safety range for stereotypical bias in a dataset that does not appear to produce stereotypical bias in the resulting model. Our findings support the need for a thorough bias analysis of public datasets in problems like FER, where a global balance of demographic representation can still hide other types of bias that harm certain demographic groups.
