Table of Contents
Fetching ...

A Survey on Facial Expression Recognition of Static and Dynamic Emotions

Yan Wang, Shaoqi Yan, Yang Liu, Wei Song, Jing Liu, Yang Chang, Xinji Mai, Xiping Hu, Wenqiang Zhang, Zhongxue Gan

TL;DR

This survey comprehensively maps the landscape of facial expression recognition by separately analyzing static FER (SFER) and dynamic FER (DFER), detailing datasets, workflows, and eight SFER vs seven DFER challenges. It surveys a wide spectrum of model families (CNNs, GCNs, Transformers) and strategies (disturbance-invariance, uncertainty handling, cross-domain/adaptation, weak supervision, and cross-modal fusion) across both image and video modalities, including 3D FER and multimodal approaches with visual-language alignment. The paper also analyzes recent advances on in-the-lab and in-the-wild benchmarks, discusses applications in health, education, and HCI, and highlights ethical concerns, bias, and privacy issues. Finally, it outlines development trends such as zero-shot FER, embodied FER, and multimodal large-language-model–assisted approaches, offering future directions and a public project page for resources and code.

Abstract

Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new challenges and apporaches are encounted, which are not well addressed in existing reviews of FER. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization. We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research. The project page for this paper can be found at https://github.com/wangyanckxx/SurveyFER.

A Survey on Facial Expression Recognition of Static and Dynamic Emotions

TL;DR

This survey comprehensively maps the landscape of facial expression recognition by separately analyzing static FER (SFER) and dynamic FER (DFER), detailing datasets, workflows, and eight SFER vs seven DFER challenges. It surveys a wide spectrum of model families (CNNs, GCNs, Transformers) and strategies (disturbance-invariance, uncertainty handling, cross-domain/adaptation, weak supervision, and cross-modal fusion) across both image and video modalities, including 3D FER and multimodal approaches with visual-language alignment. The paper also analyzes recent advances on in-the-lab and in-the-wild benchmarks, discusses applications in health, education, and HCI, and highlights ethical concerns, bias, and privacy issues. Finally, it outlines development trends such as zero-shot FER, embodied FER, and multimodal large-language-model–assisted approaches, offering future directions and a public project page for resources and code.

Abstract

Facial expression recognition (FER) aims to analyze emotional states from static images and dynamic sequences, which is pivotal in enhancing anthropomorphic communication among humans, robots, and digital avatars by leveraging AI technologies. As the FER field evolves from controlled laboratory environments to more complex in-the-wild scenarios, advanced methods have been rapidly developed and new challenges and apporaches are encounted, which are not well addressed in existing reviews of FER. This paper offers a comprehensive survey of both image-based static FER (SFER) and video-based dynamic FER (DFER) methods, analyzing from model-oriented development to challenge-focused categorization. We begin with a critical comparison of recent reviews, an introduction to common datasets and evaluation criteria, and an in-depth workflow on FER to establish a robust research foundation. We then systematically review representative approaches addressing eight main challenges in SFER (such as expression disturbance, uncertainties, compound emotions, and cross-domain inconsistency) as well as seven main challenges in DFER (such as key frame sampling, expression intensity variations, and cross-modal alignment). Additionally, we analyze recent advancements, benchmark performances, major applications, and ethical considerations. Finally, we propose five promising future directions and development trends to guide ongoing research. The project page for this paper can be found at https://github.com/wangyanckxx/SurveyFER.
Paper Structure (60 sections, 17 figures, 7 tables)

This paper contains 60 sections, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Taxonomy of FER of static and dynamic emotions. We present a hierarchical taxonomy that categorizes existing FER models by input type, task challenges, and network structures within a systematic framework, aiming to provide a comprehensive overview of the current FER research landscape. First, we have introduced datasets, metrics, and workflow (including literature and codes) into a public GitHub repository (Sec. \ref{['sec:introduction']}, \ref{['sec:MULTI_SCENE']}, and \ref{['sec:Tutorial']}). Then, image-based SFER (Sec. \ref{['sec:Static']}) and video-based DFER (Sec. \ref{['sec:Dynamic']}) overcome different task challenges using various learning strategies and model designs. Following, we analyzed recent advances of FER on benchmark datasets (Sec. \ref{['sec:Discussion']}). Finally, we discuss and conclude some important issues and potential trends in FER, highlighting directions for future developments (Sec. \ref{['sec:Applications']}, \ref{['sec:Development']}, and \ref{['sec:Conclusion']}).
  • Figure 2: The statistics of Publication (Bar) and Citation (Line) on the topic of (a) image-based SFER and (b) video-based DFER from 2016 to 2024.
  • Figure 3: Image-based static facial frames (Above): (a) JAFFE lyons1998coding, (b) CK+ lucey2010extended, (c) SFEW dhall2011static, (d) ExpW zhang2018facial, (e) RAF-DB li2017reliable, (f) AffectNet mollahosseini2017affectnet, (g) EmotioNet fabian2016emotionet, (h) 4DFAB cheng20184dfab; and video-based dynamic facial sequences (Below): (a) CK+ lucey2010extended, (b) Oulu-CASIA zhao2011facial, (c) DFEW jiang2020dfew, (d) FERV39k wang2022ferv39k, and (e) MAFW liu2022mafw of seven basic emotions in the lab and wild.
  • Figure 4: The workflow and main components of generic facial expression recognition.
  • Figure 5: The architecture of general SFER. Figure is reproduced based on (a) CNN-based model zhao2021learning, (b) GCN-based model Liu10173748, and (c) Transformer-based model Chen_10350905.
  • ...and 12 more figures