EmoCAM: Toward Understanding What Drives CNN-based Emotion Recognition
Youssef Doulfoukar, Laurent Mertens, Joost Vennekens
TL;DR
This work addresses the challenge of explaining CNN-based image emotion recognition by introducing EmoCAM, a corpus-level framework that combines Class Activation Maps with Open Images object detection to identify which object classes most influence EmoNet's emotion predictions. The method builds a joint object-emotion association matrix and compares CAM techniques via Representational Similarity Analysis, revealing broad agreement across methods. Key findings show EmoNet heavily relies on human features, particularly faces, but is highly sensitive to object insertions that can drastically shift predicted emotions, exposing potential biases and robustness concerns. The contributions yield a scalable explainability pipeline for emotion recognition, with practical implications for dataset design, bias mitigation, and model evaluation in affective computing.
Abstract
Convolutional Neural Networks are particularly suited for image analysis tasks, such as Image Classification, Object Recognition or Image Segmentation. Like all Artificial Neural Networks, however, they are "black box" models, and suffer from poor explainability. This work is concerned with the specific downstream task of Emotion Recognition from images, and proposes a framework that combines CAM-based techniques with Object Detection on a corpus level to better understand on which image cues a particular model, in our case EmoNet, relies to assign a specific emotion to an image. We demonstrate that the model mostly focuses on human characteristics, but also explore the pronounced effect of specific image modifications.
