Table of Contents
Fetching ...

Robust Dynamic Facial Expression Recognition

Feng Liu, Hanyang Wang, Siyuan Shen

TL;DR

This work tackles dynamic facial expression recognition under the challenge of coexisting hard and noisy samples. It introduces RDFER, combining an agreement-based loss reweighting strategy, a Key Expression Re-sampling Framework, and a Dual-Stream Hierarchical Network to disentangle short-term movements from long-term emotions. Empirical results on DFEW and FERV39K show state-of-the-art performance and comprehensive ablations illuminate the contributions and scalability. The approach offers robust learning under noisy-label conditions and provides insights into agreement-driven diagnostics for video-based emotion recognition with practical implications for robust learning in dynamic visual media.

Abstract

The study of Dynamic Facial Expression Recognition (DFER) is a nascent field of research that involves the automated recognition of facial expressions in video data. Although existing research has primarily focused on learning representations under noisy and hard samples, the issue of the coexistence of both types of samples remains unresolved. In order to overcome this challenge, this paper proposes a robust method of distinguishing between hard and noisy samples. This is achieved by evaluating the prediction agreement of the model on different sampled clips of the video. Subsequently, methodologies that reinforce the learning of hard samples and mitigate the impact of noisy samples can be employed. Moreover, to identify the principal expression in a video and enhance the model's capacity for representation learning, comprising a key expression re-sampling framework and a dual-stream hierarchical network is proposed, namely Robust Dynamic Facial Expression Recognition (RDFER). The key expression re-sampling framework is designed to identify the key expression, thereby mitigating the potential confusion caused by non-target expressions. RDFER employs two sequence models with the objective of disentangling short-term facial movements and long-term emotional changes. The proposed method has been shown to outperform current State-Of-The-Art approaches in DFER through extensive experimentation on benchmark datasets such as DFEW and FERV39K. A comprehensive analysis provides valuable insights and observations regarding the proposed agreement. This work has significant implications for the field of dynamic facial expression recognition and promotes the further development of the field of noise-consistent robust learning in dynamic facial expression recognition. The code is available from [https://github.com/Cross-Innovation-Lab/RDFER].

Robust Dynamic Facial Expression Recognition

TL;DR

This work tackles dynamic facial expression recognition under the challenge of coexisting hard and noisy samples. It introduces RDFER, combining an agreement-based loss reweighting strategy, a Key Expression Re-sampling Framework, and a Dual-Stream Hierarchical Network to disentangle short-term movements from long-term emotions. Empirical results on DFEW and FERV39K show state-of-the-art performance and comprehensive ablations illuminate the contributions and scalability. The approach offers robust learning under noisy-label conditions and provides insights into agreement-driven diagnostics for video-based emotion recognition with practical implications for robust learning in dynamic visual media.

Abstract

The study of Dynamic Facial Expression Recognition (DFER) is a nascent field of research that involves the automated recognition of facial expressions in video data. Although existing research has primarily focused on learning representations under noisy and hard samples, the issue of the coexistence of both types of samples remains unresolved. In order to overcome this challenge, this paper proposes a robust method of distinguishing between hard and noisy samples. This is achieved by evaluating the prediction agreement of the model on different sampled clips of the video. Subsequently, methodologies that reinforce the learning of hard samples and mitigate the impact of noisy samples can be employed. Moreover, to identify the principal expression in a video and enhance the model's capacity for representation learning, comprising a key expression re-sampling framework and a dual-stream hierarchical network is proposed, namely Robust Dynamic Facial Expression Recognition (RDFER). The key expression re-sampling framework is designed to identify the key expression, thereby mitigating the potential confusion caused by non-target expressions. RDFER employs two sequence models with the objective of disentangling short-term facial movements and long-term emotional changes. The proposed method has been shown to outperform current State-Of-The-Art approaches in DFER through extensive experimentation on benchmark datasets such as DFEW and FERV39K. A comprehensive analysis provides valuable insights and observations regarding the proposed agreement. This work has significant implications for the field of dynamic facial expression recognition and promotes the further development of the field of noise-consistent robust learning in dynamic facial expression recognition. The code is available from [https://github.com/Cross-Innovation-Lab/RDFER].

Paper Structure

This paper contains 17 sections, 5 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: The proposed methodology for discriminating between hard and noisy samples involves segmenting the video into clips. If the clips are classified as distinct categories (having circles with different colors), the video is deemed challenging, and the model is reinforced in its learning. Conversely, if the clips are classified as the same category (having circles with the same color) but possess a significant loss, the video is regarded as noisy, and the model is prevented from learning.
  • Figure 2: An overview of the Key Expression Re-sampling Framework. (a) The Key Expression Detecting Network. The input video is sampled uniformly and fed into a tiny backbone network to quickly obtain a global summary and predict the key expression. (b) The Dual-Stream Hierarchical Network. Taken the key expression predicted by (a), this network learns the representation through disentangling the short-term facial movements and long-term emotion changes with a dual-stream hierarchical design.
  • Figure 3: Sample visualization with different agreements.
  • Figure 4: 2D t-SNE visualization of dynamic facial expression features obtained with different agreements and expressions in table \ref{['tab:ablat_agree_acc']}. (a)Agreement = 0.5. (b)Agreement = 0.75. (c)Agreement = 1.00.
  • Figure 5: The confusion matrix of our proposed method evaluated on DFEW Fold 1-5.