ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition
Yan Li, Yong Zhao, Xiaohan Xia, Dongmei Jiang
TL;DR
ARPGNet addresses facial expression recognition by jointly learning appearance cues via CNNs and region-relational cues via a graph attention network. It introduces a face region relation graph and a parallel graph attention fusion module with temporal position encoding to mutually enhance appearance and relational representations, capturing both intra-sequence dynamics and inter-sequence complementarity. Extensive experiments on RML, AFEW, and Aff-wild2 demonstrate state-of-the-art or competitive performance and highlight the benefits of explicit facial structure modeling and cross-sequence fusion. The work emphasizes robust relational modeling and efficient temporal fusion, offering a practical approach for both controlled and in-the-wild expression recognition with potential extensions to occlusion handling and domain adaptation.
Abstract
The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.
