SpotFormer: Multi-Scale Spatio-Temporal Transformer for Facial Expression Spotting
Yicheng Deng, Hideaki Hayashi, Hajime Nagahara
TL;DR
This work tackles facial expression spotting, with an emphasis on micro-expressions, by introducing a sliding window-based multi-temporal-resolution optical flow (SW-MRO) feature to magnify subtle motions while mitigating head movement. It then presents SpotFormer, a multi-scale spatio-temporal Transformer that uses a facial-local graph pooling (FLGP) and learnable temporal downsampling to capture multi-scale dynamics and produce frame-level probabilities for onset, apex, offset, expression, and neutral states across macro- and micro-expressions. A supervised contrastive loss is incorporated to sharpen decision boundaries between MaE, ME, and neutral frames, improving discriminability. Extensive experiments on SAMM-LV, CAS$(M E)^2$, and CAS$(M E)^3$ show state-of-the-art performance, particularly in ME spotting, with ablations validating the contributions of SW-MRO, FLGP, and contrastive learning.
Abstract
Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based multi-temporal-resolution Optical flow (SW-MRO) feature, which calculates multi-temporal-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding the optical flow being dominated by head movements. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, we use the proposed Facial Local Graph Pooling (FLGP) operation and convolutional layers to extract multi-scale spatio-temporal features. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
