Table of Contents
Fetching ...

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

TL;DR

This work tackles facial expression spotting, with an emphasis on micro-expressions, by introducing a sliding window-based multi-temporal-resolution optical flow (SW-MRO) feature to magnify subtle motions while mitigating head movement. It then presents SpotFormer, a multi-scale spatio-temporal Transformer that uses a facial-local graph pooling (FLGP) and learnable temporal downsampling to capture multi-scale dynamics and produce frame-level probabilities for onset, apex, offset, expression, and neutral states across macro- and micro-expressions. A supervised contrastive loss is incorporated to sharpen decision boundaries between MaE, ME, and neutral frames, improving discriminability. Extensive experiments on SAMM-LV, CAS$(M E)^2$, and CAS$(M E)^3$ show state-of-the-art performance, particularly in ME spotting, with ablations validating the contributions of SW-MRO, FLGP, and contrastive learning.

Abstract

Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based multi-temporal-resolution Optical flow (SW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.

SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting

TL;DR

This work tackles facial expression spotting, with an emphasis on micro-expressions, by introducing a sliding window-based multi-temporal-resolution optical flow (SW-MRO) feature to magnify subtle motions while mitigating head movement. It then presents SpotFormer, a multi-scale spatio-temporal Transformer that uses a facial-local graph pooling (FLGP) and learnable temporal downsampling to capture multi-scale dynamics and produce frame-level probabilities for onset, apex, offset, expression, and neutral states across macro- and micro-expressions. A supervised contrastive loss is incorporated to sharpen decision boundaries between MaE, ME, and neutral frames, improving discriminability. Extensive experiments on SAMM-LV, CAS, and CAS show state-of-the-art performance, particularly in ME spotting, with ablations validating the contributions of SW-MRO, FLGP, and contrastive learning.

Abstract

Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based multi-temporal-resolution Optical flow (SW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
Paper Structure (28 sections, 6 equations, 8 figures, 7 tables)

This paper contains 28 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of macro- and micro-expression spotting.
  • Figure 2: Overview of the proposed framework. (a) The data pre-processing module calculates the SW-MRO features; (b) the probability estimation module employs SpotFormer which takes optical flow features as input for frame-level apex or boundary probability estimation; (c) the post-processing module aggregates the probability maps from all frames and generates expression proposals.
  • Figure 3: Extracted ROIs and constructed facial graph structure are denoted in yellow, while the nose tip region for face alignment is denoted in green.
  • Figure 4: Overview of the proposed SpotFormer. (a) The network structure of SpotFormer. The dotted gray box provides an intuitive illustration of the spatio-temporal attention block. A spatial graph attention layer models the spatial relationships within the graph of each frame, while a temporal node attention layer captures temporal variations of each graph node across frames. Subsequently, spatial and temporal downsampling are performed using FLGP and a learnable temporal downsampling layer, respectively. (b) The scale change between different facial graph structures through FLGP. Each facial muscle is represented by a distinct color (e.g., orange represents the lips). In the first spatial downsampling, ROIs belonging to the same facial muscle are aggregated into single nodes via FLGP to construct graph #2. In the second spatial downsampling, all aggregated nodes are further merged into a single node via FLGP to form graph #3.
  • Figure 5: Some visualization optical flow of certain micro-expression frames computed by three strategies. The data comes from the vertical component of optical flow computed at the left mouth corner when subject 11 from the SAMM-LV dataset is performing a micro-expression. (a) optical flow computed between adjacent frames; (b) optical flow computed using the proposed SW-MRO; (c) optical flow computed with a large sliding window strategy.
  • ...and 3 more figures