Table of Contents
Fetching ...

A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

Bingyu Nan, Feng Liu, Xuezhong Qian, Wei Song

TL;DR

Facial expression recognition is challenged by inter-class similarity and uneven data distribution. The authors propose Conv-cut, a truncated ConvNeXt-Base backbone augmented with a Detail Extraction Block and a self-attention mechanism to capture fine-grained facial features with fewer parameters, expressed via $Attention(Q,K,V)=softmax(QK/\sqrt{d_q})V$. The approach achieves state-of-the-art results on RAF-DB and FERPlus, outperforming recent methods and with ablations showing complementary gains from attention and the DET module. This design reduces overfitting on small FER datasets and demonstrates robust performance under real-world conditions, supporting more reliable, real-time FER systems.

Abstract

Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.

A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext

TL;DR

Facial expression recognition is challenged by inter-class similarity and uneven data distribution. The authors propose Conv-cut, a truncated ConvNeXt-Base backbone augmented with a Detail Extraction Block and a self-attention mechanism to capture fine-grained facial features with fewer parameters, expressed via . The approach achieves state-of-the-art results on RAF-DB and FERPlus, outperforming recent methods and with ablations showing complementary gains from attention and the DET module. This design reduces overfitting on small FER datasets and demonstrates robust performance under real-world conditions, supporting more reliable, real-time FER systems.

Abstract

Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: This article attempts to address the variability between the expressions of different subjects in the field of FER. (a)Uneven distribution of data in public datasets between FERPlus and RAF-DB. (b)Similarities between expression categories. (c)Differences between different contributors in the same category.
  • Figure 2: The truncated ConvNext is utilised as the foundational framework for feature extraction, with the subsequent extraction of fine-grained features being achieved via the detail extraction module. The employment of the attention mechanism serves to enhance the model's focus on fine-grained feature regions.
  • Figure 3: The confusion matrix of our proposed Conv-cut evaluated on RAF-DB and FERPlus.
  • Figure 4: 2D t-SNE visualization of facial expression features obtained through different models, including baseline model, truncated ConvNext model, Conv-cut. These features were extracted from the RAF-DB dataset.
  • Figure 5: Attention visualization for images from RAF-DB dataset. The min label and max label represent the average minimum and average maximum number of categories, respectively.