Table of Contents
Fetching ...

EmoNeXt: an Adapted ConvNeXt for Facial Emotion Recognition

Yassine El Boudouri, Amine Bohi

TL;DR

EmoNeXt tackles facial emotion recognition by adapting ConvNeXt with an integrated Spatial Transformer Network and Squeeze-and-Excitation blocks to improve spatial alignment and channel-wise feature calibration. A self-attention regularization term is added to encourage balanced, compact feature representations, with the final loss $\mathcal{L}_{final} = \mathcal{L}_{CE} + \lambda \mathcal{L}_{SA}$. Trained on FER2013 with strong data augmentation, EMA, mixed precision, and pretrained ConvNeXt weights, EmoNeXt achieves leading performance, notably 76.12% on EmoNeXt-XLarge, exceeding prior state-of-the-art single-model FER approaches. The work demonstrates that combining spatial alignment, channel recalibration, and attention regularization yields robust emotion recognition with practical implications for HCI and clinical research.

Abstract

Facial expressions play a crucial role in human communication serving as a powerful and impactful means to express a wide range of emotions. With advancements in artificial intelligence and computer vision, deep neural networks have emerged as effective tools for facial emotion recognition. In this paper, we propose EmoNeXt, a novel deep learning framework for facial expression recognition based on an adapted ConvNeXt architecture network. We integrate a Spatial Transformer Network (STN) to focus on feature-rich regions of the face and Squeeze-and-Excitation blocks to capture channel-wise dependencies. Moreover, we introduce a self-attention regularization term, encouraging the model to generate compact feature vectors. We demonstrate the superiority of our model over existing state-of-the-art deep learning models on the FER2013 dataset regarding emotion classification accuracy.

EmoNeXt: an Adapted ConvNeXt for Facial Emotion Recognition

TL;DR

EmoNeXt tackles facial emotion recognition by adapting ConvNeXt with an integrated Spatial Transformer Network and Squeeze-and-Excitation blocks to improve spatial alignment and channel-wise feature calibration. A self-attention regularization term is added to encourage balanced, compact feature representations, with the final loss . Trained on FER2013 with strong data augmentation, EMA, mixed precision, and pretrained ConvNeXt weights, EmoNeXt achieves leading performance, notably 76.12% on EmoNeXt-XLarge, exceeding prior state-of-the-art single-model FER approaches. The work demonstrates that combining spatial alignment, channel recalibration, and attention regularization yields robust emotion recognition with practical implications for HCI and clinical research.

Abstract

Facial expressions play a crucial role in human communication serving as a powerful and impactful means to express a wide range of emotions. With advancements in artificial intelligence and computer vision, deep neural networks have emerged as effective tools for facial emotion recognition. In this paper, we propose EmoNeXt, a novel deep learning framework for facial expression recognition based on an adapted ConvNeXt architecture network. We integrate a Spatial Transformer Network (STN) to focus on feature-rich regions of the face and Squeeze-and-Excitation blocks to capture channel-wise dependencies. Moreover, we introduce a self-attention regularization term, encouraging the model to generate compact feature vectors. We demonstrate the superiority of our model over existing state-of-the-art deep learning models on the FER2013 dataset regarding emotion classification accuracy.
Paper Structure (13 sections, 4 equations, 5 figures, 2 tables)

This paper contains 13 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The architecture of a spatial transformer module.
  • Figure 2: The ConvNeXt block.
  • Figure 3: The architecture of the Squeeze-and-Excitation block.
  • Figure 4: Architecture designes for ConvNeXt and EmoNeXt.
  • Figure 5: Sample images from the FER2013 dataset.