Adaptive Fusion Network with Temporal-Ranked and Motion-Intensity Dynamic Images for Micro-expression Recognition
Thi Bich Phuong Man, Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo
TL;DR
This work tackles micro-expression recognition (MER) by addressing the limits of conventional dynamic images, which often fail to capture subtle, short-lived motions. It introduces two specialized representations—Temporal-ranked dynamic image ($TRDI$) that emphasizes frames around the apex, and Motion-Intensity dynamic image ($MIDI$) that encodes frame-wise motion intensity—along with an Adaptive Fusion Network (AFN) to learn their optimal integration. The AFN comprises a Representation Fusion Block (RFB) for adaptive spatial fusion and a Multi-scale Channel Attention Block (MSCAB) for robust feature extraction, yielding state-of-the-art CASME-II performance ($Acc=93.95 ext{%, UF1}=0.897$) and strong results on SAMM and MMEW. Ablation studies confirm the necessity of the RFB and MSCAB components, demonstrating that adaptive fusion and attention are critical to capturing the fine-grained, transient cues characteristic of MEs and enabling practical MER applications.
Abstract
Micro-expressions (MEs) are subtle, transient facial changes with very low intensity, almost imperceptible to the naked eye, yet they reveal a person genuine emotion. They are of great value in lie detection, behavioral analysis, and psychological assessment. This paper proposes a novel MER method with two main contributions. First, we propose two complementary representations - Temporal-ranked dynamic image, which emphasizes temporal progression, and Motion-intensity dynamic image, which highlights subtle motions through a frame reordering mechanism incorporating motion intensity. Second, we propose an Adaptive fusion network, which automatically learns to optimally integrate these two representations, thereby enhancing discriminative ME features while suppressing noise. Experiments on three benchmark datasets (CASME-II, SAMM and MMEW) demonstrate the superiority of the proposed method. Specifically, AFN achieves 93.95 Accuracy and 0.897 UF1 on CASME-II, setting a new state-of-the-art benchmark. On SAMM, the method attains 82.47 Accuracy and 0.665 UF1, demonstrating more balanced recognition across classes. On MMEW, the model achieves 76.00 Accuracy, further confirming its generalization ability. The obtained results show that both the input and the proposed architecture play important roles in improving the performance of MER. Moreover, they provide a solid foundation for further research and practical applications in the fields of affective computing, lie detection, and human-computer interaction.
