From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

Kaylee Chhua; Zhoujinyi Wen; Vedant Hathalia; Kevin Zhu; Sean O'Brien

From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

Kaylee Chhua, Zhoujinyi Wen, Vedant Hathalia, Kevin Zhu, Sean O'Brien

TL;DR

The paper addresses racial biases in facial expression recognition (FER) within Large Multimodal Foundation Models (LMFMs) by benchmarking four models (GPT-4o, PaliGemma, Gemini, CLIP) on three datasets (RADIATE, Tarr, Chicago Face). It employs zero-shot prompts and a CLIP-embedding-based linear classifier, complemented by two-proportion $z$-tests to quantify disparities across races and emotions, revealing statistically significant biases. CLIP-based FER achieves the highest accuracies across datasets ($95.9 ext{%}$, $90.3 ext{%}$, $99.5 ext{%}$) but exhibits systematic biases, notably higher misclassification for Black Females and challenging recognition for Fear and Sadness. The study highlights the need for bias-aware FER and provides a benchmark and analytic framework to guide fairer multimodal emotion recognition in practice.

Abstract

This study addresses the racial biases in facial expression recognition (FER) systems within Large Multimodal Foundation Models (LMFMs). Despite advances in deep learning and the availability of diverse datasets, FER systems often exhibit higher error rates for individuals with darker skin tones. Existing research predominantly focuses on traditional FER models (CNNs, RNNs, ViTs), leaving a gap in understanding racial biases in LMFMs. We benchmark four leading LMFMs: GPT-4o, PaliGemma, Gemini, and CLIP to assess their performance in facial emotion detection across different racial demographics. A linear classifier trained on CLIP embeddings obtains accuracies of 95.9\% for RADIATE, 90.3\% for Tarr, and 99.5\% for Chicago Face. Furthermore, we identify that Anger is misclassified as Disgust 2.1 times more often in Black Females than White Females. This study highlights the need for fairer FER systems and establishes a foundation for developing unbiased, accurate FER technologies. Visit https://kvjvhub.github.io/FERRacialBias/ for further information regarding the biases within facial expression recognition.

From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

TL;DR

-tests to quantify disparities across races and emotions, revealing statistically significant biases. CLIP-based FER achieves the highest accuracies across datasets (

) but exhibits systematic biases, notably higher misclassification for Black Females and challenging recognition for Fear and Sadness. The study highlights the need for bias-aware FER and provides a benchmark and analytic framework to guide fairer multimodal emotion recognition in practice.

Abstract

Paper Structure (6 sections, 2 figures, 2 tables)

This paper contains 6 sections, 2 figures, 2 tables.

Introduction
Related Works
Dataset Modifications
Benchmarking Large Multimodal Foundation Models
Results and Discussion
Conclusion and Outlook

Figures (2)

Figure 1: RADIATE (Left), Tarr (Middle), and Chicago Face (Right) Dataset Images
Figure 2: RADIATE and Tarr Race and Emotion Distribution

From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

TL;DR

Abstract

From Bias to Balance: Detecting Facial Expression Recognition Biases in Large Multimodal Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)