Table of Contents
Fetching ...

AHMSA-Net: Adaptive Hierarchical Multi-Scale Attention Network for Micro-Expression Recognition

Lijun Zhang, Yifan Zhang, Weicheng Tang, Xinzhi Sun, Xiaomeng Wang, Zhanshan Li

TL;DR

This work tackles the challenge of micro-expression recognition (MER) by proposing AHMSA-Net, which uses onset-apex frames to construct high-precision three-dimensional optical flow maps consisting of horizontal flow $u$, vertical flow $v$, and optical-flow strain $os$. The architecture combines an adaptive hierarchical downsampling framework with a multi-scale attention mechanism to capture micro-movements across fine to coarse granularities and channel-spatial feature fusion, optimized with cross-entropy loss. Empirically, AHMSA-Net achieves state-of-the-art or near-state-of-the-art results on composite MER databases and CASME3, demonstrating strong cross-database robustness and improved discrimination for subtle facial motions. The approach advances MER by integrating precise motion cues with scalable attention, promising improved real-world applicability and offering a foundation for extending hierarchical multi-scale attention to related dynamic recognition tasks.

Abstract

Micro-expression recognition (MER) presents a significant challenge due to the transient and subtle nature of the motion changes involved. In recent years, deep learning methods based on attention mechanisms have made some breakthroughs in MER. However, these methods still suffer from the limitations of insufficient feature capture and poor dynamic adaptation when coping with the instantaneous subtle movement changes of micro-expressions. Therefore, in this paper, we design an Adaptive Hierarchical Multi-Scale Attention Network (AHMSA-Net) for MER. Specifically, we first utilize the onset and apex frames of the micro-expression sequence to extract three-dimensional (3D) optical flow maps, including horizontal optical flow, vertical optical flow, and optical flow strain. Subsequently, the optical flow feature maps are inputted into AHMSA-Net, which consists of two parts: an adaptive hierarchical framework and a multi-scale attention mechanism. Based on the adaptive downsampling hierarchical attention framework, AHMSA-Net captures the subtle changes of micro-expressions from different granularities (fine and coarse) by dynamically adjusting the size of the optical flow feature map at each layer. Based on the multi-scale attention mechanism, AHMSA-Net learns micro-expression action information by fusing features from different scales (channel and spatial). These two modules work together to comprehensively improve the accuracy of MER. Additionally, rigorous experiments demonstrate that the proposed method achieves competitive results on major micro-expression databases, with AHMSA-Net achieving recognition accuracy of up to 78.21% on composite databases (SMIC, SAMM, CASMEII) and 77.08% on the CASME^{}3 database.

AHMSA-Net: Adaptive Hierarchical Multi-Scale Attention Network for Micro-Expression Recognition

TL;DR

This work tackles the challenge of micro-expression recognition (MER) by proposing AHMSA-Net, which uses onset-apex frames to construct high-precision three-dimensional optical flow maps consisting of horizontal flow , vertical flow , and optical-flow strain . The architecture combines an adaptive hierarchical downsampling framework with a multi-scale attention mechanism to capture micro-movements across fine to coarse granularities and channel-spatial feature fusion, optimized with cross-entropy loss. Empirically, AHMSA-Net achieves state-of-the-art or near-state-of-the-art results on composite MER databases and CASME3, demonstrating strong cross-database robustness and improved discrimination for subtle facial motions. The approach advances MER by integrating precise motion cues with scalable attention, promising improved real-world applicability and offering a foundation for extending hierarchical multi-scale attention to related dynamic recognition tasks.

Abstract

Micro-expression recognition (MER) presents a significant challenge due to the transient and subtle nature of the motion changes involved. In recent years, deep learning methods based on attention mechanisms have made some breakthroughs in MER. However, these methods still suffer from the limitations of insufficient feature capture and poor dynamic adaptation when coping with the instantaneous subtle movement changes of micro-expressions. Therefore, in this paper, we design an Adaptive Hierarchical Multi-Scale Attention Network (AHMSA-Net) for MER. Specifically, we first utilize the onset and apex frames of the micro-expression sequence to extract three-dimensional (3D) optical flow maps, including horizontal optical flow, vertical optical flow, and optical flow strain. Subsequently, the optical flow feature maps are inputted into AHMSA-Net, which consists of two parts: an adaptive hierarchical framework and a multi-scale attention mechanism. Based on the adaptive downsampling hierarchical attention framework, AHMSA-Net captures the subtle changes of micro-expressions from different granularities (fine and coarse) by dynamically adjusting the size of the optical flow feature map at each layer. Based on the multi-scale attention mechanism, AHMSA-Net learns micro-expression action information by fusing features from different scales (channel and spatial). These two modules work together to comprehensively improve the accuracy of MER. Additionally, rigorous experiments demonstrate that the proposed method achieves competitive results on major micro-expression databases, with AHMSA-Net achieving recognition accuracy of up to 78.21% on composite databases (SMIC, SAMM, CASMEII) and 77.08% on the CASME^{}3 database.
Paper Structure (27 sections, 7 equations, 5 figures, 5 tables)

This paper contains 27 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall framework for micro-expression recognition. The micro-expression recognition framework highlighted in the red box is divided into two parts: the three-dimensional (3D) optical flow extraction part and the AHMSA-Net part. The multi-scale attention mechanism highlighted in the blue box consists of three stacked modules: the channel attention module, the spatial attention module, and the feed-forward module. Additionally, each module undergoes LayerNorm processing before data input.
  • Figure 2: Internal details of the channel attention module and spatial attention module.
  • Figure 3: Confusion Matrix for All Micro-expression Databases.
  • Figure 4: Impact of Batch Size.
  • Figure 5: Impact of Number of Multi-Scale Attention Blocks.