A Visual Self-attention Mechanism Facial Expression Recognition Network beyond Convnext
Bingyu Nan, Feng Liu, Xuezhong Qian, Wei Song
TL;DR
Facial expression recognition is challenged by inter-class similarity and uneven data distribution. The authors propose Conv-cut, a truncated ConvNeXt-Base backbone augmented with a Detail Extraction Block and a self-attention mechanism to capture fine-grained facial features with fewer parameters, expressed via $Attention(Q,K,V)=softmax(QK/\sqrt{d_q})V$. The approach achieves state-of-the-art results on RAF-DB and FERPlus, outperforming recent methods and with ablations showing complementary gains from attention and the DET module. This design reduces overfitting on small FER datasets and demonstrates robust performance under real-world conditions, supporting more reliable, real-time FER systems.
Abstract
Facial expression recognition is an important research direction in the field of artificial intelligence. Although new breakthroughs have been made in recent years, the uneven distribution of datasets and the similarity between different categories of facial expressions, as well as the differences within the same category among different subjects, remain challenges. This paper proposes a visual facial expression signal feature processing network based on truncated ConvNeXt approach(Conv-cut), to improve the accuracy of FER under challenging conditions. The network uses a truncated ConvNeXt-Base as the feature extractor, and then we designed a Detail Extraction Block to extract detailed features, and introduced a Self-Attention mechanism to enable the network to learn the extracted features more effectively. To evaluate the proposed Conv-cut approach, we conducted experiments on the RAF-DB and FERPlus datasets, and the results show that our model has achieved state-of-the-art performance. Our code could be accessed at Github.
