Table of Contents
Fetching ...

A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models

Tangzheng Lian, Oya Celiktutan

TL;DR

This work tackles the challenge of evaluating demographic biases in facial expression recognition (FER) without needing demographic labels on the test set. It introduces a feature-space bias evaluation framework that uses a probe dataset to measure differential associations and a plug-in permutation-based statistical module to ensure significance, enabling robust bias analysis across age, gender, and race for seven expressions and multiple architectures on AffectNet. The method demonstrates superior alignment with ground-truth biases compared to approaches relying on pseudo-demographic labels, and reveals stronger age- and race-related biases, with transformer architectures showing larger biases than CNNs. Overall, the framework provides a practical tool for fairer FER deployments and highlights considerations for using visually perceived demographics in fairness assessments and their ethical implications.

Abstract

Recent studies on fairness have shown that Facial Expression Recognition (FER) models exhibit biases toward certain visually perceived demographic groups. However, the limited availability of human-annotated demographic labels in public FER datasets has constrained the scope of such bias analysis. To overcome this limitation, some prior works have resorted to pseudo-demographic labels, which may distort bias evaluation results. Alternatively, in this paper, we propose a feature-level bias evaluation framework for evaluating demographic biases in FER models under the setting where demographic labels are unavailable in the test set. Extensive experiments demonstrate that our method more effectively evaluates demographic biases compared to existing approaches that rely on pseudo-demographic labels. Furthermore, we observe that many existing studies do not include statistical testing in their bias evaluations, raising concerns that some reported biases may not be statistically significant but rather due to randomness. To address this issue, we introduce a plug-and-play statistical module to ensure the statistical significance of biased evaluation results. A comprehensive bias analysis based on the proposed module is then conducted across three sensitive attributes (age, gender, and race), seven facial expressions, and multiple network architectures on a large-scale dataset, revealing the prominent demographic biases in FER and providing insights on selecting a fairer network architecture.

A Feature-level Bias Evaluation Framework for Facial Expression Recognition Models

TL;DR

This work tackles the challenge of evaluating demographic biases in facial expression recognition (FER) without needing demographic labels on the test set. It introduces a feature-space bias evaluation framework that uses a probe dataset to measure differential associations and a plug-in permutation-based statistical module to ensure significance, enabling robust bias analysis across age, gender, and race for seven expressions and multiple architectures on AffectNet. The method demonstrates superior alignment with ground-truth biases compared to approaches relying on pseudo-demographic labels, and reveals stronger age- and race-related biases, with transformer architectures showing larger biases than CNNs. Overall, the framework provides a practical tool for fairer FER deployments and highlights considerations for using visually perceived demographics in fairness assessments and their ethical implications.

Abstract

Recent studies on fairness have shown that Facial Expression Recognition (FER) models exhibit biases toward certain visually perceived demographic groups. However, the limited availability of human-annotated demographic labels in public FER datasets has constrained the scope of such bias analysis. To overcome this limitation, some prior works have resorted to pseudo-demographic labels, which may distort bias evaluation results. Alternatively, in this paper, we propose a feature-level bias evaluation framework for evaluating demographic biases in FER models under the setting where demographic labels are unavailable in the test set. Extensive experiments demonstrate that our method more effectively evaluates demographic biases compared to existing approaches that rely on pseudo-demographic labels. Furthermore, we observe that many existing studies do not include statistical testing in their bias evaluations, raising concerns that some reported biases may not be statistically significant but rather due to randomness. To address this issue, we introduce a plug-and-play statistical module to ensure the statistical significance of biased evaluation results. A comprehensive bias analysis based on the proposed module is then conducted across three sensitive attributes (age, gender, and race), seven facial expressions, and multiple network architectures on a large-scale dataset, revealing the prominent demographic biases in FER and providing insights on selecting a fairer network architecture.

Paper Structure

This paper contains 30 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Race (left column) and age (right column) distribution for each facial expression in the RAF-DB li2017reliable dataset. African-American faces showing fear and very young (0–3 years) or older (70+) faces expressing fear or anger are severely underrepresented, each with fewer than 20 samples. These categories still need to be further split into train/val/test sets.
  • Figure 2: Illustration of the previous bias evaluation pipeline in FER (green) and our proposed framework (orange), which evaluates biases in the feature space. A statistical module is also introduced, applicable to both the previous pipeline and our framework, to ensure the statistical significance of observed performance disparities and differential associations. The FER model remains frozen throughout the evaluation.
  • Figure 3: t-SNE visualization from our pilot study: A FER model trained solely on expression labels effectively encodes unseen face images into well-defined clusters based on gender, age, and race, demonstrating that these features are well-preserved in the feature space of FER models.
  • Figure 4: Sensitivity analysis of the threshold $\alpha$ across four bias evaluation methods. The closer the curves align with the ground truth, the better the evaluation method.
  • Figure 5: Experimental results of our proposed methods across multiple network architectures.