Weakly Supervised Continuous Micro-Expression Intensity Estimation Using Temporal Deep Neural Network
Riyadh Mohammed Almushrafy
TL;DR
This work tackles the absence of frame-level micro-expression intensity labels by introducing a dataset-agnostic, weakly supervised framework that uses triangular pseudo-intensity trajectories derived from onset–apex–offset annotations. It combines a ResNet18-based spatial encoder with a bidirectional GRU temporal model to predict dense frame-wise intensities, supervised by a composite loss including MSE, smoothness, and apex ranking. Across SAMM and CASME II, the model achieves strong temporal agreement with the pseudo-labels, significantly outperforming a frame-wise baseline and demonstrating robustness to dataset-specific differences. The approach offers a practical, reproducible path toward continuous micro-expression analysis under realistic annotation constraints and paves the way for cross-dataset generalization and integration with broader affective computing tasks.
Abstract
Micro-facial expressions are brief and involuntary facial movements that reflect genuine emotional states. While most prior work focuses on classifying discrete micro-expression categories, far fewer studies address the continuous evolution of intensity over time. Progress in this direction is limited by the lack of frame-level intensity labels, which makes fully supervised regression impractical. We propose a unified framework for continuous micro-expression intensity estimation using only weak temporal labels (onset, apex, offset). A simple triangular prior converts sparse temporal landmarks into dense pseudo-intensity trajectories, and a lightweight temporal regression model that combines a ResNet18 encoder with a bidirectional GRU predicts frame-wise intensity directly from image sequences. The method requires no frame-level annotation effort and is applied consistently across datasets through a single preprocessing and temporal alignment pipeline. Experiments on SAMM and CASME II show strong temporal agreement with the pseudo-intensity trajectories. On SAMM, the model reaches a Spearman correlation of 0.9014 and a Kendall correlation of 0.7999, outperforming a frame-wise baseline. On CASME II, it achieves up to 0.9116 and 0.8168, respectively, when trained without the apex-ranking term. Ablation studies confirm that temporal modeling and structured pseudo labels are central to capturing the rise-apex-fall dynamics of micro-facial movements. To our knowledge, this is the first unified approach for continuous micro-expression intensity estimation using only sparse temporal annotations.
