Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

F. Xavier Gaya-Morey; Cristina Manresa-Yee; Célia Martinie; Jose M. Buades-Rubio

Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

F. Xavier Gaya-Morey, Cristina Manresa-Yee, Célia Martinie, Jose M. Buades-Rubio

TL;DR

This work addresses how the quality, diversity, and annotation of Facial Expression Recognition (FER) datasets affect deep learning performance. It compiles 24 FER datasets, applies a normalization pipeline, and adds automatic age and gender annotations, introducing three similarity metrics—Local Similarity, Global Similarity, and Paired Similarity—to quantify dataset difficulty, generalization, and cross-dataset transferability, using cross-dataset benchmarking with Swin Transformer and ConvNeXt. The study finds that large-scale, automatically collected datasets (e.g., AffectNet, FER2013) tend to generalize better across unseen data despite labeling noise and demographic biases, while manually curated datasets offer higher annotation quality but less variability. The authors provide practical recommendations for dataset selection and construction, emphasizing dataset size, demographic diversity, modality considerations (images vs videos), and the use of similarity-based analyses to inform multi-dataset training strategies, with publicly released data and code to support reproducibility.

Abstract

This study investigates the key characteristics and suitability of widely used Facial Expression Recognition (FER) datasets for training deep learning models. In the field of affective computing, FER is essential for interpreting human emotions, yet the performance of FER systems is highly contingent on the quality and diversity of the underlying datasets. To address this issue, we compiled and analyzed 24 FER datasets, including those targeting specific age groups such as children, adults, and the elderly, and processed them through a comprehensive normalization pipeline. In addition, we enriched the datasets with automatic annotations for age and gender, enabling a more nuanced evaluation of their demographic properties. To further assess dataset efficacy, we introduce three novel metricsLocal, Global, and Paired Similarity, which quantitatively measure dataset difficulty, generalization capability, and cross-dataset transferability. Benchmark experiments using state-of-the-art neural networks reveal that large-scale, automatically collected datasets (e.g., AffectNet, FER2013) tend to generalize better, despite issues with labeling noise and demographic biases, whereas controlled datasets offer higher annotation quality but limited variability. Our findings provide actionable recommendations for dataset selection and design, advancing the development of more robust, fair, and effective FER systems.

Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

TL;DR

Abstract

Evaluating Facial Expression Recognition Datasets for Deep Learning: A Benchmark Study with Novel Similarity Metrics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)