Table of Contents
Fetching ...

ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition

Yan Li, Yong Zhao, Xiaohan Xia, Dongmei Jiang

TL;DR

ARPGNet addresses facial expression recognition by jointly learning appearance cues via CNNs and region-relational cues via a graph attention network. It introduces a face region relation graph and a parallel graph attention fusion module with temporal position encoding to mutually enhance appearance and relational representations, capturing both intra-sequence dynamics and inter-sequence complementarity. Extensive experiments on RML, AFEW, and Aff-wild2 demonstrate state-of-the-art or competitive performance and highlight the benefits of explicit facial structure modeling and cross-sequence fusion. The work emphasizes robust relational modeling and efficient temporal fusion, offering a practical approach for both controlled and in-the-wild expression recognition with potential extensions to occlusion handling and domain adaptation.

Abstract

The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.

ARPGNet: Appearance- and Relation-aware Parallel Graph Attention Fusion Network for Facial Expression Recognition

TL;DR

ARPGNet addresses facial expression recognition by jointly learning appearance cues via CNNs and region-relational cues via a graph attention network. It introduces a face region relation graph and a parallel graph attention fusion module with temporal position encoding to mutually enhance appearance and relational representations, capturing both intra-sequence dynamics and inter-sequence complementarity. Extensive experiments on RML, AFEW, and Aff-wild2 demonstrate state-of-the-art or competitive performance and highlight the benefits of explicit facial structure modeling and cross-sequence fusion. The work emphasizes robust relational modeling and efficient temporal fusion, offering a practical approach for both controlled and in-the-wild expression recognition with potential extensions to occlusion handling and domain adaptation.

Abstract

The key to facial expression recognition is to learn discriminative spatial-temporal representations that embed facial expression dynamics. Previous studies predominantly rely on pre-trained Convolutional Neural Networks (CNNs) to learn facial appearance representations, overlooking the relationships between facial regions. To address this issue, this paper presents an Appearance- and Relation-aware Parallel Graph attention fusion Network (ARPGNet) to learn mutually enhanced spatial-temporal representations of appearance and relation information. Specifically, we construct a facial region relation graph and leverage the graph attention mechanism to model the relationships between facial regions. The resulting relational representation sequences, along with CNN-based appearance representation sequences, are then fed into a parallel graph attention fusion module for mutual interaction and enhancement. This module simultaneously explores the complementarity between different representation sequences and the temporal dynamics within each sequence. Experimental results on three facial expression recognition datasets demonstrate that the proposed ARPGNet outperforms or is comparable to state-of-the-art methods.

Paper Structure

This paper contains 40 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The proposed appearance- and relation-aware parallel graph attention fusion network. A facial image sequence is processed by two parallel modules: a CNN-based appearance representation learning module, and a GNN-based relation representation learning module. The former extracts facial appearance information, while the latter models high-level relationships among different facial regions. After introducing temporal position encoding, the two learned facial embedding sequences are then fed into the parallel graph attention fusion module to simultaneously capture the complementary information between the sequences and model the temporal dynamics within each sequence. The mutually enhanced higher-level representations are concatenated and pooled over time to generate a video-level representation, which is then input into an MLP for facial expression recognition.
  • Figure 2: Facial feature map after adaptive average pooling layer ($P=6$) and corresponding facial region relation sub-graph of node 8. The connection with itself is omitted for simplicity.
  • Figure 3: Some samples of the three facial expression recognition datasets.
  • Figure 4: Comparison of model accuracy and inference time.
  • Figure 5: Visualization of attention scores between the relation representation node at frame 10 (marked as the target node) and its neighbors. It can be seen that inter-sequence complementary information and intra-sequence temporal dynamics can help each other in representation learning.
  • ...and 3 more figures