Table of Contents
Fetching ...

Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models

Mohammad Mehdi Hosseini, Ali Pourramezan Fard, Mohammad H. Mahoor

TL;DR

This work addresses bias and fairness in facial expression recognition by conducting a holistic analysis of four in-the-wild FER datasets (AffectNet, ExpW, Fer2013, RAF-DB) and eight architectures (CNNs, Vision Transformers, CLIP, GPT-4o-mini, and FER-specialized models). It introduces a unified bias assessment framework using seven metrics to probe dataset bias and four fairness metrics to quantify model bias, including cross-dataset generalization tests such as leave-one-dataset-out. Key findings show that AffectNet and ExpW can generalize across datasets despite imbalances, while high-accuracy models like GPT-4o-mini and ViT also exhibit substantial bias; ResNet and XceptionNet tend to be more robust to bias, and FER-specialized models do not fully mitigate fairness concerns. The results stress the need for fairness-aware data curation and training strategies to ensure equitable FER deployment in real-world settings, with future work exploring non-demographic biases, fairness constraints, and multi-modal approaches.

Abstract

Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.

Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models

TL;DR

This work addresses bias and fairness in facial expression recognition by conducting a holistic analysis of four in-the-wild FER datasets (AffectNet, ExpW, Fer2013, RAF-DB) and eight architectures (CNNs, Vision Transformers, CLIP, GPT-4o-mini, and FER-specialized models). It introduces a unified bias assessment framework using seven metrics to probe dataset bias and four fairness metrics to quantify model bias, including cross-dataset generalization tests such as leave-one-dataset-out. Key findings show that AffectNet and ExpW can generalize across datasets despite imbalances, while high-accuracy models like GPT-4o-mini and ViT also exhibit substantial bias; ResNet and XceptionNet tend to be more robust to bias, and FER-specialized models do not fully mitigate fairness concerns. The results stress the need for fairness-aware data curation and training strategies to ensure equitable FER deployment in real-world settings, with future work exploring non-demographic biases, fairness constraints, and multi-modal approaches.

Abstract

Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets--AffectNet, ExpW, Fer2013, and RAF-DB--are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.

Paper Structure

This paper contains 23 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Our approach focuses on examining equity, including bias and fairness, in FER datasets and models. We leveraged various metrics to assess both datasets and models, where each metric quantifies specific aspects and enables numerical evaluation and comparison.
  • Figure 2: Bias can originate from datasets and models. In datasets, prominent sources of bias include issues in data collection, such as demographic disparities, variations in illumination and lighting conditions, gestures, head poses, and cultural differences in emotional interpretation. In models, key bias sources include architecture design, training parameters, overfitting to specific demographic groups, and the selection of evaluation metrics.
  • Figure 3: The data distribution across different datasets shows several trends: a) Happy and Neutral dominate the datasets, while Fear and Disgust are underrepresented. Among all datasets, Fer2013 exhibits the most balanced expression distribution, b) A noticeable bias is observed in the age groups, where [16$\sim$32] and [33$\sim$53] being more frequent, while [0$\sim$15] and [Over 54] have significantly fewer samples, c) Across all the datasets, there are more Man samples than Womans. This gender imbalance is most pronounced in the ExpW dataset and least evident in AffectNet, d) Regarding race, White samples are the most represented group, while Indians are the least represented. Data distribution for Black, Latinx, and Middle-Eastern races is more even.
  • Figure 4: This 2D correlation matrix illustrates the relationships between different attributes across all the datasets (in percent). The diagram visualizes the data distribution for each attribute value and highlights the contribution of each attribute to others. This heat map reveals biases toward Neutral and Happy. The age groups [16$\sim$32] and [33$\sim$53] are the most prominent, while Man appears the most frequent gender. In terms of race, White is overrepresented, whereas Indian is underrepresented.
  • Figure 5: The 4D correlation matrix between different attributes of all the datasets is presented, with expressions represented in the rows and the columns divided first by age groups, followed by gender, and finally by race groups. This heat map highlights an uneven data distribution, where a great portion of data are underrepresented. This diagram reveals limited and imbalance diversity in the datasets.
  • ...and 3 more figures