Neuromorphic Valence and Arousal Estimation
Lorenzo Berlincioni, Luca Cultrera, Federico Becattini, Alberto Del Bimbo
TL;DR
The paper tackles continuous valence and arousal estimation from facial expressions using neuromorphic (event camera) data. It trains multiple frame- and video-based models on a synthetic neuromorphic analogue of the RGB AFEW-VA dataset created via a V2E simulator, enabling fully labeled neuromorphic data without extra annotation. The approach achieves state-of-the-art results on AFEW-VA and demonstrates zero-shot transfer to real event data (NEFER) for emotion recognition, validating both the data-generation pipeline and model generalization. Key contributions include a comparison of frame- and video-based architectures, an analysis of Temporal Binary Representation encoding with varying bit-depth $N$, and a practical zero-shot deployment scenario for neuromorphic affective computing, with potential impact on privacy-preserving, low-latency emotion analysis. The continuous valence-arousal targets are in the range $[-1,1]$, enabling fine-grained mood tracking from high-temporal-resolution event streams.
Abstract
Recognizing faces and their underlying emotions is an important aspect of biometrics. In fact, estimating emotional states from faces has been tackled from several angles in the literature. In this paper, we follow the novel route of using neuromorphic data to predict valence and arousal values from faces. Due to the difficulty of gathering event-based annotated videos, we leverage an event camera simulator to create the neuromorphic counterpart of an existing RGB dataset. We demonstrate that not only training models on simulated data can still yield state-of-the-art results in valence-arousal estimation, but also that our trained models can be directly applied to real data without further training to address the downstream task of emotion recognition. In the paper we propose several alternative models to solve the task, both frame-based and video-based.
