FindingEmo: An Image Dataset for Emotion Recognition in the Wild
Laurent Mertens, Elahe' Yargholi, Hans Op de Beeck, Jan Van den Stock, Joost Vennekens
TL;DR
FindingEmo introduces a large-scale image dataset for emotion recognition in natural social scenes, annotating entire images with Valence in $[-3,3]$ and Arousal in $[0,6]$ alongside Plutchik-based Emo8/Emo24 labels for scenes with multiple people. The dataset comprises 25,869 public and 1,525 private images collected via a two-phase process (image scraping and Prolific-based annotations) across 51 runs by 655 annotators, totaling roughly £10k in costs. Baseline experiments using ImageNet models, EmoNet, CLIP, and DINOv2 reveal the task is challenging, with Arousal harder to predict than Valence and with CNNs sometimes outperforming ViTs on discrete emotion classification; late fusion offers modest gains. Beyond-baseline analyses show that facial-emotion cues substantially boost performance, while some streams like CLIP/DINOv2 provide limited improvements, underscoring the complexity of scene-level emotion recognition and the need for novel modeling approaches. The work also discusses reliability, biases, and ethical implications, and provides open-source code and interfaces to foster further research on higher-order social cognition in the wild.
Abstract
We introduce FindingEmo, a new image dataset containing annotations for 25k images, specifically tailored to Emotion Recognition. Contrary to existing datasets, it focuses on complex scenes depicting multiple people in various naturalistic, social settings, with images being annotated as a whole, thereby going beyond the traditional focus on faces or single individuals. Annotated dimensions include Valence, Arousal and Emotion label, with annotations gathered using Prolific. Together with the annotations, we release the list of URLs pointing to the original images, as well as all associated source code.
