FindingEmo: An Image Dataset for Emotion Recognition in the Wild

Laurent Mertens; Elahe' Yargholi; Hans Op de Beeck; Jan Van den Stock; Joost Vennekens

FindingEmo: An Image Dataset for Emotion Recognition in the Wild

Laurent Mertens, Elahe' Yargholi, Hans Op de Beeck, Jan Van den Stock, Joost Vennekens

TL;DR

FindingEmo introduces a large-scale image dataset for emotion recognition in natural social scenes, annotating entire images with Valence in $[-3,3]$ and Arousal in $[0,6]$ alongside Plutchik-based Emo8/Emo24 labels for scenes with multiple people. The dataset comprises 25,869 public and 1,525 private images collected via a two-phase process (image scraping and Prolific-based annotations) across 51 runs by 655 annotators, totaling roughly £10k in costs. Baseline experiments using ImageNet models, EmoNet, CLIP, and DINOv2 reveal the task is challenging, with Arousal harder to predict than Valence and with CNNs sometimes outperforming ViTs on discrete emotion classification; late fusion offers modest gains. Beyond-baseline analyses show that facial-emotion cues substantially boost performance, while some streams like CLIP/DINOv2 provide limited improvements, underscoring the complexity of scene-level emotion recognition and the need for novel modeling approaches. The work also discusses reliability, biases, and ethical implications, and provides open-source code and interfaces to foster further research on higher-order social cognition in the wild.

Abstract

We introduce FindingEmo, a new image dataset containing annotations for 25k images, specifically tailored to Emotion Recognition. Contrary to existing datasets, it focuses on complex scenes depicting multiple people in various naturalistic, social settings, with images being annotated as a whole, thereby going beyond the traditional focus on faces or single individuals. Annotated dimensions include Valence, Arousal and Emotion label, with annotations gathered using Prolific. Together with the annotations, we release the list of URLs pointing to the original images, as well as all associated source code.

FindingEmo: An Image Dataset for Emotion Recognition in the Wild

TL;DR

FindingEmo introduces a large-scale image dataset for emotion recognition in natural social scenes, annotating entire images with Valence in

and Arousal in

alongside Plutchik-based Emo8/Emo24 labels for scenes with multiple people. The dataset comprises 25,869 public and 1,525 private images collected via a two-phase process (image scraping and Prolific-based annotations) across 51 runs by 655 annotators, totaling roughly £10k in costs. Baseline experiments using ImageNet models, EmoNet, CLIP, and DINOv2 reveal the task is challenging, with Arousal harder to predict than Valence and with CNNs sometimes outperforming ViTs on discrete emotion classification; late fusion offers modest gains. Beyond-baseline analyses show that facial-emotion cues substantially boost performance, while some streams like CLIP/DINOv2 provide limited improvements, underscoring the complexity of scene-level emotion recognition and the need for novel modeling approaches. The work also discusses reliability, biases, and ethical implications, and provides open-source code and interfaces to foster further research on higher-order social cognition in the wild.

Abstract

Paper Structure (43 sections, 4 equations, 20 figures, 6 tables)

This paper contains 43 sections, 4 equations, 20 figures, 6 tables.

Introduction
Dataset Description
Valence and Arousal
Emotion
Positioning Versus Existing Datasets
Dataset Creation Process
Phase 1
Phase 2
Annotator Grading and Annotator Overlap
Statistics and Observations
Baseline Model Results
Beyond the Baseline
Discussion
Findings
Limitations
...and 28 more sections

Figures (20)

Figure 1: An image from the FindingEmo dataset. Photo courtesy The Kitcheners (https://thekitcheners.co.uk/).
Figure 2: Plutchik's Wheel of Emotions.
Figure 3: Distribution of Emotion annotations for the public set per Plutchik emotion leaf.
Figure 4: Association between Valence and Arousal values. The bigger the disc, the more often the (Valence, Arousal)-pair appears in the dataset.
Figure 5: Test data baseline performance on the Emo8 classification and Arousal and Valence regression tasks. Metrics are: Weighted F1 (W.F1) and Average Precision (AP) for classification, and Mean Absolute Error (MAE) and Spearman R correlation coefficient (S.R) for regression. The starting learning rate and loss corresponding to each model are displayed above the training bars. (U)CE = (Unbalanced)CrossEntropyLoss, (W)MSE = (Weighted)MeanSquaredError loss, p365 = original model trained on Places365 dataset.
...and 15 more figures

FindingEmo: An Image Dataset for Emotion Recognition in the Wild

TL;DR

Abstract

FindingEmo: An Image Dataset for Emotion Recognition in the Wild

Authors

TL;DR

Abstract

Table of Contents

Figures (20)