Design of an Expression Recognition Solution Based on the Global Channel-Spatial Attention Mechanism and Proportional Criterion Fusion
Jun Yu, Yang Zheng, Lei Wang, Yongqi Wang, Shengfan Xu
TL;DR
This work addresses facial expression recognition in the wild by integrating audio and visual modalities through a lightweight, end-to-end framework. It introduces a Global Channel-Spatial Attention mechanism to enhance modality features, and a proportional criterion-based fusion to combine audio and image decisions, guided by a coarse–fine granularity loss for optimization. The approach employs MCNN for audio and RHCNN for video feature extraction, with a three-stage pipeline and targeted preprocessing. On ABAW 2025 data, the method achieves strong performance, ranking third on the official validation set and demonstrating the value of cross-modal attention and decision-level fusion for robust emotion recognition.
Abstract
Facial expression recognition is a challenging classification task that holds broad application prospects in the field of human-computer interaction. This paper aims to introduce the method we will adopt in the 8th Affective and Behavioral Analysis in the Wild (ABAW) Competition, which will be held during the Conference on Computer Vision and Pattern Recognition (CVPR) in 2025.First of all, we apply the frequency masking technique and the method of extracting data at equal time intervals to conduct targeted processing on the original videos. Then, based on the residual hybrid convolutional neural network and the multi-branch convolutional neural network respectively, we design feature extraction models for image and audio sequences. In particular, we propose a global channel-spatial attention mechanism to enhance the features initially extracted from both the audio and image modalities respectively.Finally, we adopt a decision fusion strategy based on the proportional criterion to fuse the classification results of the two single modalities, obtain an emotion probability vector, and output the final emotional classification. We also design a coarse - fine granularity loss function to optimize the performance of the entire network, which effectively improves the accuracy of facial expression recognition.In the facial expression recognition task of the 8th ABAW Competition, our method ranked third on the official validation set. This result fully confirms the effectiveness and competitiveness of the method we have proposed.
