Table of Contents
Fetching ...

Emotion Recognition Using Transformers with Masked Learning

Seongjae Min, Junseok Yang, Sangjun Lim, Junyong Lee, Sangwon Lee, Sejoon Lim

TL;DR

This paper tackles emotion analysis in-the-wild, focusing on Valence-Arousal ($VA$) estimation, facial expression recognition, and Action Unit (AU) detection using ABAW-style data. It proposes a Transformer-based framework that uses a Vision Transformer-based feature extractor and a Transformer Classifier to model temporal dynamics from temporally ordered, masked frame features, with a loss strategy that addresses data imbalance and temporal generalization. The main contributions are a random frame masking learning technique and the application of Focal loss for imbalance along with CCC loss for $VA$, enabling improved performance on real-world datasets. The approach potentially advances emotional computing by delivering more robust, temporally aware emotion understanding under diverse conditions.

Abstract

In recent years, deep learning has achieved innovative advancements in various fields, including the analysis of human emotions and behaviors. Initiatives such as the Affective Behavior Analysis in-the-wild (ABAW) competition have been particularly instrumental in driving research in this area by providing diverse and challenging datasets that enable precise evaluation of complex emotional states. This study leverages the Vision Transformer (ViT) and Transformer models to focus on the estimation of Valence-Arousal (VA), which signifies the positivity and intensity of emotions, recognition of various facial expressions, and detection of Action Units (AU) representing fundamental muscle movements. This approach transcends traditional Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) based methods, proposing a new Transformer-based framework that maximizes the understanding of temporal and spatial features. The core contributions of this research include the introduction of a learning technique through random frame masking and the application of Focal loss adapted for imbalanced data, enhancing the accuracy and applicability of emotion and behavior analysis in real-world settings. This approach is expected to contribute to the advancement of emotional computing and deep learning methodologies.

Emotion Recognition Using Transformers with Masked Learning

TL;DR

This paper tackles emotion analysis in-the-wild, focusing on Valence-Arousal () estimation, facial expression recognition, and Action Unit (AU) detection using ABAW-style data. It proposes a Transformer-based framework that uses a Vision Transformer-based feature extractor and a Transformer Classifier to model temporal dynamics from temporally ordered, masked frame features, with a loss strategy that addresses data imbalance and temporal generalization. The main contributions are a random frame masking learning technique and the application of Focal loss for imbalance along with CCC loss for , enabling improved performance on real-world datasets. The approach potentially advances emotional computing by delivering more robust, temporally aware emotion understanding under diverse conditions.

Abstract

In recent years, deep learning has achieved innovative advancements in various fields, including the analysis of human emotions and behaviors. Initiatives such as the Affective Behavior Analysis in-the-wild (ABAW) competition have been particularly instrumental in driving research in this area by providing diverse and challenging datasets that enable precise evaluation of complex emotional states. This study leverages the Vision Transformer (ViT) and Transformer models to focus on the estimation of Valence-Arousal (VA), which signifies the positivity and intensity of emotions, recognition of various facial expressions, and detection of Action Units (AU) representing fundamental muscle movements. This approach transcends traditional Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) based methods, proposing a new Transformer-based framework that maximizes the understanding of temporal and spatial features. The core contributions of this research include the introduction of a learning technique through random frame masking and the application of Focal loss adapted for imbalanced data, enhancing the accuracy and applicability of emotion and behavior analysis in real-world settings. This approach is expected to contribute to the advancement of emotional computing and deep learning methodologies.
Paper Structure (9 sections, 2 equations, 1 figure, 1 table)

This paper contains 9 sections, 2 equations, 1 figure, 1 table.

Figures (1)

  • Figure 1: illustrates the comprehensive pipeline of the our model. Initially, A pretrained vision transformer individually extracts features from each input frame image (where $b$ stands for batch size, and $n$ represents sequential length), ensuring a detailed analysis of every frame. To avert the risk of overfitting, these extracted features from each frame are randomly masked. In the final step, a transformer classifier sequentially processes these randomly masked frame features to predict the outcome $\hat{y}$