EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition

Youssef Doulfoukar; Laurent Mertens; Joost Vennekens

EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition

Youssef Doulfoukar, Laurent Mertens, Joost Vennekens

TL;DR

This work addresses the challenge of explaining CNN-based image emotion recognition by introducing EmoCAM, a corpus-level framework that combines Class Activation Maps with Open Images object detection to identify which object classes most influence EmoNet's emotion predictions. The method builds a joint object-emotion association matrix and compares CAM techniques via Representational Similarity Analysis, revealing broad agreement across methods. Key findings show EmoNet heavily relies on human features, particularly faces, but is highly sensitive to object insertions that can drastically shift predicted emotions, exposing potential biases and robustness concerns. The contributions yield a scalable explainability pipeline for emotion recognition, with practical implications for dataset design, bias mitigation, and model evaluation in affective computing.

Abstract

Convolutional Neural Networks are particularly suited for image analysis tasks, such as Image Classification, Object Recognition or Image Segmentation. Like all Artificial Neural Networks, however, they are "black box" models, and suffer from poor explainability. This work is concerned with the specific downstream task of Emotion Recognition from images, and proposes a framework that combines CAM-based techniques with Object Detection on a corpus level to better understand on which image cues a particular model, in our case EmoNet, relies to assign a specific emotion to an image. We demonstrate that the model mostly focuses on human characteristics, but also explore the pronounced effect of specific image modifications.

EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition

TL;DR

Abstract

Paper Structure (8 sections, 7 figures)

This paper contains 8 sections, 7 figures.

Introduction
Methodology
Results
Results for Grad-CAM
Comparison of CAM Methods
Prediction Stability
Limitations and Future Work
Conclusion

Figures (7)

Figure 1: Schematic illustrating pipeline combining CAM with Object Detection. Photo by Sander Sammy on Unsplash, https://tinyurl.com/2tc69hfy, https://unsplash.com/license license.
Figure 2: Association between Open Images classes and predicted EmoNet label. Heatmap entries represent the percentage of images labeled with a certain EmoNet label for which at least one object of the corresponding Open Images class was detected with high enough importance. "Aest. Appr." = Aesthetic Appreciation.
Figure 3: RSA analysis of different CAM methods.
Figure 4: Adversarial example. Original image on the left, labeled by EmoNet as 92.9% "Joy". Modified part on the right; full modified image labeled as 66.1% "Excitement". Sources: original photo by istolethetv, https://tinyurl.com/3r86e3w2, CC 2.0 license; Rugby ball by Peter Griffin, https://tinyurl.com/dmb77rks, CC0 license.
Figure 5: Schematic illustration of "paste object in image" experiment. The grid on the left illustrates the relative positions within the image the objects are pasted and centered at, with the considered objects shown on the right. Sources: Rugby ball, see \ref{['fig:adv_ex']}; Soccer ball by Jean Schecter, https://tinyurl.com/3xxkkysb, CC 4.0 BY-NC; Lotus flower at https://pngimg.com/image/69752, CC 4.0 BY-NC.
...and 2 more figures

EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition

TL;DR

Abstract

EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (7)