Table of Contents
Fetching ...

Multi-Scale Spatio-Temporal Graph Convolutional Network for Facial Expression Spotting

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

TL;DR

A Multi-Scale Spatio-Temporal Graph Convolutional Network (SpoT-GCN) for facial expression spotting is proposed, which learns both local and global features from multiple scales of facial graph structures using the proposed facial local graph pooling (FLGP).

Abstract

Facial expression spotting is a significant but challenging task in facial expression analysis. The accuracy of expression spotting is affected not only by irrelevant facial movements but also by the difficulty of perceiving subtle motions in micro-expressions. In this paper, we propose a Multi-Scale Spatio-Temporal Graph Convolutional Network (SpoT-GCN) for facial expression spotting. To extract more robust motion features, we track both short- and long-term motion of facial muscles in compact sliding windows whose window length adapts to the temporal receptive field of the network. This strategy, termed the receptive field adaptive sliding window strategy, effectively magnifies the motion features while alleviating the problem of severe head movement. The subtle motion features are then converted to a facial graph representation, whose spatio-temporal graph patterns are learned by a graph convolutional network. This network learns both local and global features from multiple scales of facial graph structures using our proposed facial local graph pooling (FLGP). Furthermore, we introduce supervised contrastive learning to enhance the discriminative capability of our model for difficult-to-classify frames. The experimental results on the SAMM-LV and CAS(ME)^2 datasets demonstrate that our method achieves state-of-the-art performance, particularly in micro-expression spotting. Ablation studies further verify the effectiveness of our proposed modules.

Multi-Scale Spatio-Temporal Graph Convolutional Network for Facial Expression Spotting

TL;DR

A Multi-Scale Spatio-Temporal Graph Convolutional Network (SpoT-GCN) for facial expression spotting is proposed, which learns both local and global features from multiple scales of facial graph structures using the proposed facial local graph pooling (FLGP).

Abstract

Facial expression spotting is a significant but challenging task in facial expression analysis. The accuracy of expression spotting is affected not only by irrelevant facial movements but also by the difficulty of perceiving subtle motions in micro-expressions. In this paper, we propose a Multi-Scale Spatio-Temporal Graph Convolutional Network (SpoT-GCN) for facial expression spotting. To extract more robust motion features, we track both short- and long-term motion of facial muscles in compact sliding windows whose window length adapts to the temporal receptive field of the network. This strategy, termed the receptive field adaptive sliding window strategy, effectively magnifies the motion features while alleviating the problem of severe head movement. The subtle motion features are then converted to a facial graph representation, whose spatio-temporal graph patterns are learned by a graph convolutional network. This network learns both local and global features from multiple scales of facial graph structures using our proposed facial local graph pooling (FLGP). Furthermore, we introduce supervised contrastive learning to enhance the discriminative capability of our model for difficult-to-classify frames. The experimental results on the SAMM-LV and CAS(ME)^2 datasets demonstrate that our method achieves state-of-the-art performance, particularly in micro-expression spotting. Ablation studies further verify the effectiveness of our proposed modules.
Paper Structure (18 sections, 7 equations, 6 figures, 5 tables)

This paper contains 18 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of macro- and micro-expression spotting.
  • Figure 2: Extracted ROIs and constructed facial graph structure are denoted in yellow, while the nose tip region for face alignment is denoted in green.
  • Figure 3: Overview of the proposed framework. (a) The data pre-processing module partitions the input video into overlapping temporal windows using the receptive field adaptive sliding window strategy and extracts facial graph-structured optical flows; (b) the feature learning module employs the SpoT-GCN which takes optical flow features as input for frame-level apex or boundary probability estimation; (c) the post-processing module aggregates the probability maps from all frames and generates expression proposals.
  • Figure 4: Network structure of our SpoT-GCN and the scale change between different facial graph structures through FLGP.
  • Figure 5: Visualization analysis of supervised contrastive learning. (a) and (b) show the PCA distribution of certain macro- and micro-expression frames: (a) without supervised contrastive learning and (b) with supervised contrastive learning. (c) and (d) depict the PCA distribution of certain micro-expression frames and normal frames: (c) without supervised contrastive learning and (d) with supervised contrastive learning.
  • ...and 1 more figures