Table of Contents
Fetching ...

Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI

F. Xavier Gaya-Morey, Silvia Ramis-Guarinos, Cristina Manresa-Yee, Jose M. Buades-Rubio

TL;DR

Comparing twelve different networks, including both general object classifiers and FER-specific models, suggests limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity as similar architectures prioritize similar facial regions.

Abstract

Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.

Unveiling the Human-like Similarities of Automatic Facial Expression Recognition: An Empirical Exploration through Explainable AI

TL;DR

Comparing twelve different networks, including both general object classifiers and FER-specific models, suggests limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity as similar architectures prioritize similar facial regions.

Abstract

Facial expression recognition is vital for human behavior analysis, and deep learning has enabled models that can outperform humans. However, it is unclear how closely they mimic human processing. This study aims to explore the similarity between deep neural networks and human perception by comparing twelve different networks, including both general object classifiers and FER-specific models. We employ an innovative global explainable AI method to generate heatmaps, revealing crucial facial regions for the twelve networks trained on six facial expressions. We assess these results both quantitatively and qualitatively, comparing them to ground truth masks based on Friesen and Ekman's description and among them. We use Intersection over Union (IoU) and normalized correlation coefficients for comparisons. We generate 72 heatmaps to highlight critical regions for each expression and architecture. Qualitatively, models with pre-trained weights show more similarity in heatmaps compared to those without pre-training. Specifically, eye and nose areas influence certain facial expressions, while the mouth is consistently important across all models and expressions. Quantitatively, we find low average IoU values (avg. 0.2702) across all expressions and architectures. The best-performing architecture averages 0.3269, while the worst-performing one averages 0.2066. Dendrograms, built with the normalized correlation coefficient, reveal two main clusters for most expressions: models with pre-training and models without pre-training. Findings suggest limited alignment between human and AI facial expression recognition, with network architectures influencing the similarity, as similar architectures prioritize similar facial regions.
Paper Structure (31 sections, 7 equations, 10 figures, 3 tables)

This paper contains 31 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Samples for each class (by columns) available from each of the datasets (by rows) used in this study.
  • Figure 2: (a) Standard image with landmarks. (b) Standard image with landmarks and triangulation.
  • Figure 3: Images from the different steps involved in the explanation of an image. By rows, an example of each class: Anger, Disgust, Fear, Happiness, Sadness and Surprise. By columns: a) image being explained, b) detected face landmarks, c) superpixels computed using SLIC segmentation, d) LIME explanation, e) transformed input image using the normalized landmarks coordinates, and f) transformed LIME relevance for each region in gray scale (for further heatmap computation), using the same landmarks.
  • Figure 4: Face areas involved in each expression, following Friesen and Ekman's Friesen1983 description.
  • Figure 5: Mean accuracy of the cross validation for each trained network.
  • ...and 5 more figures