MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

Dimitrios Kollias; Andreas Psaroudakis; Anastasios Arsenos; Paraskevi Theofilou; Chunchang Shao; Guanyu Hu; Ioannis Patras

MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

Dimitrios Kollias, Andreas Psaroudakis, Anastasios Arsenos, Paraskevi Theofilou, Chunchang Shao, Guanyu Hu, Ioannis Patras

TL;DR

MMA-MRNNet introduces a dynamic, unimodal architecture for video-level Facial Expression Intensity Estimation by first extracting rich frame-level affective representations (valence-arousal, basic expressions, and action units) with a Multi-Task Learning CNN, and then aggregating them through a Masked RNN that adapts to variable video lengths via a Mask routing mechanism. The model employs AU-proxy representations to inject prior knowledge into the training objective and optimizes a loss tied to correlation rather than mean-squared error, improving convergence and robustness. Extensive experiments on Hume-Reaction and multiple in-the-wild datasets show state-of-the-art performance across FEIE tasks, with ablations demonstrating the value of combining all affect channels and the efficacy of dynamic routing. The approach addresses limitations of 3-D CNNs and ad-hoc frame handling, offering a scalable, accurate solution for real-world FEIE from videos.

Abstract

This paper presents MMA-MRNNet, a novel deep learning architecture for dynamic multi-output Facial Expression Intensity Estimation (FEIE) from video data. Traditional approaches to this task often rely on complex 3-D CNNs, which require extensive pre-training and assume that facial expressions are uniformly distributed across all frames of a video. These methods struggle to handle videos of varying lengths, often resorting to ad-hoc strategies that either discard valuable information or introduce bias. MMA-MRNNet addresses these challenges through a two-stage process. First, the Multiple Models of Affect (MMA) extractor component is a Multi-Task Learning CNN that concurrently estimates valence-arousal, recognizes basic facial expressions, and detects action units in each frame. These representations are then processed by a Masked RNN component, which captures temporal dependencies and dynamically updates weights according to the true length of the input video, ensuring that only the most relevant features are used for the final prediction. The proposed unimodal non-ensemble learning MMA-MRNNet was evaluated on the Hume-Reaction dataset and demonstrated significantly superior performance, surpassing state-of-the-art methods by a wide margin, regardless of whether they were unimodal, multimodal, or ensemble approaches. Finally, we demonstrated the effectiveness of the MMA component of our proposed method across multiple in-the-wild datasets, where it consistently outperformed all state-of-the-art methods across various metrics.

MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 2 figures, 5 tables)

This paper contains 11 sections, 4 equations, 2 figures, 5 tables.

Introduction
Related Work
Methodology
MMA: Multiple Models of Affect extractor Component
MRNN: Masked RNN and Routing Component
Datasets, Pre-Processing and Implementation Details
Experimental Results
Comparison with the state-of-the-art
Ablation Study
MMA Evaluation Results
Conclusion

Figures (2)

Figure 1: Overview of the proposed MMA-MRNNet for dynamic multi-output Facial Expression Intensity Estimation. MMA-MRNNet comprises two main components: the Multiple Models of Affect (MMA) extractor, which generates affective representations (valence-arousal, basic expressions, and action units) for each video frame, and the Masked RNN and Routing (MRNN), which captures temporal dependencies and dynamically selects key features (and updates weights) according to the variable lengths of input videos.
Figure 2: The Mulptiple Models of Affect extractor Component (MMA) that outputs for each frame the following emotional descriptors: valence and arousal, 17 action units and 7 basic expressions

MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

TL;DR

Abstract

MMA-MRNNet: Harnessing Multiple Models of Affect and Dynamic Masked RNN for Precise Facial Expression Intensity Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)