Emotion Recognition from the perspective of Activity Recognition

Savinay Nagendra; Prapti Panigrahi

Emotion Recognition from the perspective of Activity Recognition

Savinay Nagendra, Prapti Panigrahi

TL;DR

This work addresses continuous emotion recognition in unconstrained real-world settings by modeling affect with continuous valence $V$ and arousal $A$ rather than discrete labels. It reframes emotion recognition as an action-recognition problem and introduces a three-stream end-to-end regression pipeline with spatial self-attention-based key-frame sampling, eye-mouth optical-flow streams, and temporal Gaussian attention filters, evaluated on the in-the-wild AFEW-VA dataset where $V$ and $A$ are annotated per frame in $[-10,10]$. The approach achieves competitive, often state-of-the-art, CCC and MSE metrics on this dataset, demonstrating robustness to illumination, pose, and occlusion in naturalistic conditions. This methodology offers a practical path toward deploying continuous affect models on mobile devices and interactive systems by leveraging robust spatio-temporal representations and attention-driven frame selection.

Abstract

Applications of an efficient emotion recognition system can be found in several domains such as medicine, driver fatigue surveillance, social robotics, and human-computer interaction. Appraising human emotional states, behaviors, and reactions displayed in real-world settings can be accomplished using latent continuous dimensions. Continuous dimensional models of human affect, such as those based on valence and arousal are more accurate in describing a broad range of spontaneous everyday emotions than more traditional models of discrete stereotypical emotion categories (e.g. happiness, surprise). Most of the prior work on estimating valence and arousal considers laboratory settings and acted data. But, for emotion recognition systems to be deployed and integrated into real-world mobile and computing devices, we need to consider data collected in the world. Action recognition is a domain of Computer Vision that involves capturing complementary information on appearance from still frames and motion between frames. In this paper, we treat emotion recognition from the perspective of action recognition by exploring the application of deep learning architectures specifically designed for action recognition, for continuous affect recognition. We propose a novel three-stream end-to-end deep learning regression pipeline with an attention mechanism, which is an ensemble design based on sub-modules of multiple state-of-the-art action recognition systems. The pipeline constitutes a novel data pre-processing approach with a spatial self-attention mechanism to extract keyframes. The optical flow of high-attention regions of the face is extracted to capture temporal context. AFEW-VA in-the-wild dataset has been used to conduct comparative experiments. Quantitative analysis shows that the proposed model outperforms multiple standard baselines of both emotion recognition and action recognition models.

Emotion Recognition from the perspective of Activity Recognition

TL;DR

This work addresses continuous emotion recognition in unconstrained real-world settings by modeling affect with continuous valence

and arousal

rather than discrete labels. It reframes emotion recognition as an action-recognition problem and introduces a three-stream end-to-end regression pipeline with spatial self-attention-based key-frame sampling, eye-mouth optical-flow streams, and temporal Gaussian attention filters, evaluated on the in-the-wild AFEW-VA dataset where

and

are annotated per frame in

. The approach achieves competitive, often state-of-the-art, CCC and MSE metrics on this dataset, demonstrating robustness to illumination, pose, and occlusion in naturalistic conditions. This methodology offers a practical path toward deploying continuous affect models on mobile devices and interactive systems by leveraging robust spatio-temporal representations and attention-driven frame selection.

Abstract

Paper Structure (20 sections, 6 equations, 11 figures)

This paper contains 20 sections, 6 equations, 11 figures.

Introduction
Related Work
AFEW-VA: emotion recognition dataset in-the-wild
Data
Annotations
Our Approach
Key Frame Sub-sampling
Local Feature Extraction
Spatial Attention
Temporal Pooling
Optical Flow
Temporal Gaussian Attention Filters
Network Architecture and Training
Results
Baselines
...and 5 more sections

Figures (11)

Figure 1: The 2D Emotion wheel or the 2D Valence-Arousal Space
Figure 2: Pipeline of our Emotion Recognition System
Figure 3: Data Statistics of our dataset
Figure 4: Screenshot of the annotation tool developed and used to annotate the AFEW-VA dataset.
Figure 5: Example of annotated valence and arousal levels for a sample video from our dataset along with some representative frames.
...and 6 more figures

Emotion Recognition from the perspective of Activity Recognition

TL;DR

Abstract

Emotion Recognition from the perspective of Activity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (11)