Table of Contents
Fetching ...

Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

Kejun Liu, Yuanyuan Liu, Lin Wei, Chang Tang, Yibing Zhan, Zijing Chen, Zhe Chen

TL;DR

This work tackles the emotion recognition gap by augmenting facial expressions with eye-behavior cues in the EMER dataset, enabling joint analysis of ER and FER. It introduces EMERT, a modality-adversarial, multitask Transformer that fuses eye movements, eye fixations, and facial expressions to improve both ER and FER. Extensive experiments across seven benchmarking protocols show EMERT consistently outperforms state-of-the-art multimodal methods and demonstrates robustness to noise, as well as deeper insights into how eye behaviors complement facial cues. The EMER dataset and EMERT models provide a comprehensive platform for studying the emotion gap and advancing robust, multimodal emotion understanding in real-world settings.

Abstract

Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

Smile on the Face, Sadness in the Eyes: Bridging the Emotion Gap with a Multimodal Dataset of Eye and Facial Behaviors

TL;DR

This work tackles the emotion recognition gap by augmenting facial expressions with eye-behavior cues in the EMER dataset, enabling joint analysis of ER and FER. It introduces EMERT, a modality-adversarial, multitask Transformer that fuses eye movements, eye fixations, and facial expressions to improve both ER and FER. Extensive experiments across seven benchmarking protocols show EMERT consistently outperforms state-of-the-art multimodal methods and demonstrates robustness to noise, as well as deeper insights into how eye behaviors complement facial cues. The EMER dataset and EMERT models provide a comprehensive platform for studying the emotion gap and advancing robust, multimodal emotion understanding in real-world settings.

Abstract

Emotion Recognition (ER) is the process of analyzing and identifying human emotions from sensing data. Currently, the field heavily relies on facial expression recognition (FER) because visual channel conveys rich emotional cues. However, facial expressions are often used as social tools rather than manifestations of genuine inner emotions. To understand and bridge this gap between FER and ER, we introduce eye behaviors as an important emotional cue and construct an Eye-behavior-aided Multimodal Emotion Recognition (EMER) dataset. To collect data with genuine emotions, spontaneous emotion induction paradigm is exploited with stimulus material, during which non-invasive eye behavior data, like eye movement sequences and eye fixation maps, is captured together with facial expression videos. To better illustrate the gap between ER and FER, multi-view emotion labels for mutimodal ER and FER are separately annotated. Furthermore, based on the new dataset, we design a simple yet effective Eye-behavior-aided MER Transformer (EMERT) that enhances ER by bridging the emotion gap. EMERT leverages modality-adversarial feature decoupling and a multitask Transformer to model eye behaviors as a strong complement to facial expressions. In the experiment, we introduce seven multimodal benchmark protocols for a variety of comprehensive evaluations of the EMER dataset. The results show that the EMERT outperforms other state-of-the-art multimodal methods by a great margin, revealing the importance of modeling eye behaviors for robust ER. To sum up, we provide a comprehensive analysis of the importance of eye behaviors in ER, advancing the study on addressing the gap between FER and ER for more robust ER performance. Our EMER dataset and the trained EMERT models will be publicly available at https://github.com/kejun1/EMER.

Paper Structure

This paper contains 41 sections, 4 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: An example from our EMER dataset. EMER comprises facial expression videos, eye movement sequences, and eye fixation maps, along with multi-view emotion annotations, including FER labels and ER labels, providing more comprehensive emotion analysis.
  • Figure 2: The collection framework for our EMER dataset. The EMER dataset is multimodal, participant-rich, and multi-view annotation emotion dateset, providing a novel research direction in understanding the emotion gap between ER and FER.
  • Figure 3: Some examples of stimulus materials.
  • Figure 4: The SAM self-assessment for the ER annotation.
  • Figure 5: The ALA pipeline for high-reliability FER annotation, including model auto-annotation, expert annotation, and annotation reliability assessment.
  • ...and 4 more figures