Emotion Recognition from the perspective of Activity Recognition
Savinay Nagendra, Prapti Panigrahi
TL;DR
This work addresses continuous emotion recognition in unconstrained real-world settings by modeling affect with continuous valence $V$ and arousal $A$ rather than discrete labels. It reframes emotion recognition as an action-recognition problem and introduces a three-stream end-to-end regression pipeline with spatial self-attention-based key-frame sampling, eye-mouth optical-flow streams, and temporal Gaussian attention filters, evaluated on the in-the-wild AFEW-VA dataset where $V$ and $A$ are annotated per frame in $[-10,10]$. The approach achieves competitive, often state-of-the-art, CCC and MSE metrics on this dataset, demonstrating robustness to illumination, pose, and occlusion in naturalistic conditions. This methodology offers a practical path toward deploying continuous affect models on mobile devices and interactive systems by leveraging robust spatio-temporal representations and attention-driven frame selection.
Abstract
Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.
