Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition
Feng Liu, Bingyu Nan, Xuezhong Qian, Xiaolan Fu
TL;DR
Dynamic micro-expression recognition is challenged by brief, localized cues that are difficult to model with single modalities. The authors introduce TSFmicro, a dual-stream framework that fuses temporal differences captured by a RetNet-based temporal branch with spatial cues processed by a shallow Transformer spatial branch, using a parallel high-dimensional time-space fusion to learn informative semantic representations. Across CASME II, SAMM, and CAS(ME)^3, TSFmicro achieves state-of-the-art results, with notable gains on small-class categories, and visualization shows better region localization and discriminability. The work underscores the value of multimodal spatio-temporal fusion for robust DMER and opens directions for cross-cultural generalization and interpretability, while noting current limitations and proposing transfer and contrastive learning for future improvements.
Abstract
When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.
