Table of Contents
Fetching ...

Interpretable Image Emotion Recognition: A Domain Adaptation Approach Using Facial Expressions

Puneet Kumar, Balasubramanian Raman

TL;DR

The paper addresses Image Emotion Recognition (IER) by proposing a feature-based domain adaptation framework that transfers a Facial Expression Recognition (FER) model to generic images using a discrepancy loss while preserving the same network architecture. It introduces DnCShap, a Divide and Conquer SHAP-based interpretability method for pixel-level and layer-wise explanations, increasing transparency of emotion predictions. Evaluation on four diverse datasets (IAPSa, ArtPhoto, FI, EMOTIC) shows competitive accuracies (61.86%, 62.47%, 70.78%, 59.72%) and strong interpretability analyses, with ablations validating the contributions of both the domain adaptation and the interpretability components. The work delivers a practical, interpretable IER pipeline that generalizes across facial and non-facial content and lays groundwork for future multimodal extensions and broader emotion taxonomies.

Abstract

This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recognition (FER) system is developed, classifying facial images into discrete emotion classes. Maintaining the same network architecture, this FER system is then adapted to recognize emotions in generic images through the application of discrepancy loss, enabling the model to effectively learn IER features while classifying emotions into categories such as 'happy,' 'sad,' 'hate,' and 'anger.' Additionally, a novel interpretability method, Divide and Conquer based Shap (DnCShap), is introduced to elucidate the visual features most relevant for emotion recognition. The proposed IER system demonstrated emotion classification accuracies of 61.86% for the IAPSa dataset, 62.47 for the ArtPhoto dataset, 70.78% for the FI dataset, and 59.72% for the EMOTIC dataset. The system effectively identifies the important visual features that lead to specific emotion classifications and also provides detailed embedding plots explaining the predictions, enhancing the understanding and trust in AI-driven emotion recognition systems.

Interpretable Image Emotion Recognition: A Domain Adaptation Approach Using Facial Expressions

TL;DR

The paper addresses Image Emotion Recognition (IER) by proposing a feature-based domain adaptation framework that transfers a Facial Expression Recognition (FER) model to generic images using a discrepancy loss while preserving the same network architecture. It introduces DnCShap, a Divide and Conquer SHAP-based interpretability method for pixel-level and layer-wise explanations, increasing transparency of emotion predictions. Evaluation on four diverse datasets (IAPSa, ArtPhoto, FI, EMOTIC) shows competitive accuracies (61.86%, 62.47%, 70.78%, 59.72%) and strong interpretability analyses, with ablations validating the contributions of both the domain adaptation and the interpretability components. The work delivers a practical, interpretable IER pipeline that generalizes across facial and non-facial content and lays groundwork for future multimodal extensions and broader emotion taxonomies.

Abstract

This paper proposes a feature-based domain adaptation technique for identifying emotions in generic images, encompassing both facial and non-facial objects, as well as non-human components. This approach addresses the challenge of the limited availability of pre-trained models and well-annotated datasets for Image Emotion Recognition (IER). Initially, a deep-learning-based Facial Expression Recognition (FER) system is developed, classifying facial images into discrete emotion classes. Maintaining the same network architecture, this FER system is then adapted to recognize emotions in generic images through the application of discrepancy loss, enabling the model to effectively learn IER features while classifying emotions into categories such as 'happy,' 'sad,' 'hate,' and 'anger.' Additionally, a novel interpretability method, Divide and Conquer based Shap (DnCShap), is introduced to elucidate the visual features most relevant for emotion recognition. The proposed IER system demonstrated emotion classification accuracies of 61.86% for the IAPSa dataset, 62.47 for the ArtPhoto dataset, 70.78% for the FI dataset, and 59.72% for the EMOTIC dataset. The system effectively identifies the important visual features that lead to specific emotion classifications and also provides detailed embedding plots explaining the predictions, enhancing the understanding and trust in AI-driven emotion recognition systems.

Paper Structure

This paper contains 31 sections, 13 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: The proposed FER system's architecture contains FER, residual, and interpretability modules. The input is processed through FER module's layers B1 to B5, followed by residual module's parallel branches R1 to R4. The interpretability module uses DnCShap to analyze pixel relevance for emotion recognition.
  • Figure 2: Architecture of the proposed IER system, adapted from the pre-trained FER model. The convolutional blocks B1 to B5 and residual modules facilitate the feature adaptation necessary for handling generic images in the IER model. The integration of discrepancy loss demonstrates how domain adaptation techniques optimize the model for diverse emotional cues. The Classification Loss ensures the accurate categorization of emotional states, enhancing the overall performance and effectiveness of the model.
  • Figure 3: Illustration of Shapley value calculation. Nodes (Node1 to Node4) represent different configurations of feature combinations where Node1 has no features, and subsequent nodes incrementally include features $f_1$ and $f_2$ while $w_{pq}$ denotes the weight coefficient for node pair $Node_p$-$Node_q$.
  • Figure 4: Confusion matrices for IAPSa, ArtPhoto, FI, and EMOTIC datasets.
  • Figure 5: Feature-wise Interpretability for various emotion classes. Areas marked in red highlight the most significant visual features contributing to emotion recognition.
  • ...and 1 more figures