Table of Contents
Fetching ...

MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

Linhuang Wang, Xin Kang, Fei Ding, Satoshi Nakagawa, Fuji Ren

TL;DR

The paper tackles dynamic facial expression recognition (DFER) by exploiting localized facial-muscle changes. It introduces MSSTNet, a framework that uses MELayer to encode multi-scale spatial features and a Temporal Transformer (T-Former) to model temporal dynamics. Ablation studies and visualizations demonstrate that integrating multi-scale information over time enhances discriminative capacity. On in-the-wild datasets such as DFEW and FERV39k, the approach achieves state-of-the-art performance, underscoring its potential for robust real-world DFER applications.

Abstract

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

TL;DR

The paper tackles dynamic facial expression recognition (DFER) by exploiting localized facial-muscle changes. It introduces MSSTNet, a framework that uses MELayer to encode multi-scale spatial features and a Temporal Transformer (T-Former) to model temporal dynamics. Ablation studies and visualizations demonstrate that integrating multi-scale information over time enhances discriminative capacity. On in-the-wild datasets such as DFEW and FERV39k, the approach achieves state-of-the-art performance, underscoring its potential for robust real-world DFER applications.

Abstract

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.
Paper Structure (6 sections, 6 equations, 3 figures, 3 tables)

This paper contains 6 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) and (b) represent image sequences for video action recognition and DFER, respectively. In (a), distinct moving targets are present, while in (b), only localized changes in facial muscle states are observed.
  • Figure 2: The overall architecture of MSSTNet. The visual feature extraction is a CNN backbone network, and the T-Former is a transformer structure composed of $L$ blocks. PE signifies temporal positional embedding, $\bigoplus$ denotes element-wise addition, MHSA stands for multi-head self-attention mechanism and T-Mean indicates temporal dimension averaging.
  • Figure 3: Visualization of feature map. The feature maps within the red boxes represent the outputs after passing through the T-Former, while those without boxes depict feature maps without undergoing the T-Former.