Table of Contents
Fetching ...

Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

Feng Liu, Bingyu Nan, Xuezhong Qian, Xiaolan Fu

TL;DR

Dynamic micro-expression recognition is challenged by brief, localized cues that are difficult to model with single modalities. The authors introduce TSFmicro, a dual-stream framework that fuses temporal differences captured by a RetNet-based temporal branch with spatial cues processed by a shallow Transformer spatial branch, using a parallel high-dimensional time-space fusion to learn informative semantic representations. Across CASME II, SAMM, and CAS(ME)^3, TSFmicro achieves state-of-the-art results, with notable gains on small-class categories, and visualization shows better region localization and discriminability. The work underscores the value of multimodal spatio-temporal fusion for robust DMER and opens directions for cross-cultural generalization and interpretability, while noting current limitations and proposing transfer and contrastive learning for future improvements.

Abstract

When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.

Temporal and Spatial Feature Fusion Framework for Dynamic Micro Expression Recognition

TL;DR

Dynamic micro-expression recognition is challenged by brief, localized cues that are difficult to model with single modalities. The authors introduce TSFmicro, a dual-stream framework that fuses temporal differences captured by a RetNet-based temporal branch with spatial cues processed by a shallow Transformer spatial branch, using a parallel high-dimensional time-space fusion to learn informative semantic representations. Across CASME II, SAMM, and CAS(ME)^3, TSFmicro achieves state-of-the-art results, with notable gains on small-class categories, and visualization shows better region localization and discriminability. The work underscores the value of multimodal spatio-temporal fusion for robust DMER and opens directions for cross-cultural generalization and interpretability, while noting current limitations and proposing transfer and contrastive learning for future improvements.

Abstract

When emotions are repressed, an individual's true feelings may be revealed through micro-expressions. Consequently, micro-expressions are regarded as a genuine source of insight into an individual's authentic emotions. However, the transient and highly localised nature of micro-expressions poses a significant challenge to their accurate recognition, with the accuracy rate of micro-expression recognition being as low as 50%, even for professionals. In order to address these challenges, it is necessary to explore the field of dynamic micro expression recognition (DMER) using multimodal fusion techniques, with special attention to the diverse fusion of temporal and spatial modal features. In this paper, we propose a novel Temporal and Spatial feature Fusion framework for DMER (TSFmicro). This framework integrates a Retention Network (RetNet) and a transformer-based DMER network, with the objective of efficient micro-expression recognition through the capture and fusion of temporal and spatial relations. Meanwhile, we propose a novel parallel time-space fusion method from the perspective of modal fusion, which fuses spatio-temporal information in high-dimensional feature space, resulting in complementary "where-how" relationships at the semantic level and providing richer semantic information for the model. The experimental results demonstrate the superior performance of the TSFmicro method in comparison to other contemporary state-of-the-art methods. This is evidenced by its effectiveness on three well-recognised micro-expression datasets.

Paper Structure

This paper contains 20 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: This paper attempts to consider the potential impact of temporal fusion on micro-expression recognition performance from the perspective of modal fusion. (a) Temporal information between frames is extracted as a temporal feature using difference frames. (b) Position embedding is utilized to learn the positional information associated with the occurrence of an action in order to map it to temporal features. (c) Extract spatio-temporal information through temporal and spatial streams and experiment with different spatio-temporal fusion approaches. (d) Performance of TSFmicro with different spatio-temporal modal fusion approaches.
  • Figure 2: An overview of the proposed TSFmicro architecture is presented below. (a)The process of TSFmicro is outlined as follows: firstly, the face is cropped; secondly, the difference frames between Apex and Onset frames are used as the temporal information and Onset frames are used as the spatial information; thirdly, the spatio-temporal sub-branch captures and fuses the spatio-temporal information; and finally, the data is categorized. (b) Fusion module. (c) The structure of the T to S (early) fusion approach. (d) The structure of the T to S fusion approach. (e) Structural delineation of the S-to-T fusion approach. (f) The structure of the T-S (late) fusion approach.
  • Figure 3: Evaluation scores of SAMM, CASME II and CAS(ME)$^3$ datasets under 3/4-class and 5/7-class classification conditions during TSFmicro training.
  • Figure 4: Confusion matrix evaluation results of our proposed TSFmicro framework with different datasets.
  • Figure 5: Evaluation scores of different fusion methods on the SAMM and CASME II datasets under 5 classification conditions during TSFmicro training.
  • ...and 2 more figures