Table of Contents
Fetching ...

Explainable Emotion Decoding for Human and Computer Vision

Alessio Borriero, Martina Milazzo, Matteo Diano, Davide Orsenigo, Maria Chiara Villa, Chiara Di Fazio, Marco Tamietto, Alan Perotti

TL;DR

The paper presents a parallel analysis of emotion decoding in humans and computer vision using the StudyForrest dataset, applying LIME and SHAP to fMRI-based brain decoders and frame-level CNN explanations to movie frames. It demonstrates strong emotion and face decoding performance in both modalities and shows that explainability maps reveal a core, hierarchical brain network involving regions such as the OFC and insula, while frame saliency aligns with human eye movements mainly for faces. The cross-domain analysis links CNN attention with gaze data to explore neural correlates of attention, supporting a constructionist view of emotion processing and suggesting pathways for integrating neuroscience and ML with future RSA-based analyses. Overall, the work provides a framework for interpretable emotion decoding across biological and artificial vision systems, with implications for both neuroscience and AI research.

Abstract

Modern Machine Learning (ML) has significantly advanced various research fields, but the opaque nature of ML models hinders their adoption in several domains. Explainable AI (XAI) addresses this challenge by providing additional information to help users understand the internal decision-making process of ML models. In the field of neuroscience, enriching a ML model for brain decoding with attribution-based XAI techniques means being able to highlight which brain areas correlate with the task at hand, thus offering valuable insights to domain experts. In this paper, we analyze human and Computer Vision (CV) systems in parallel, training and explaining two ML models based respectively on functional Magnetic Resonance Imaging (fMRI) and movie frames. We do so by leveraging the "StudyForrest" dataset, which includes functional Magnetic Resonance Imaging (fMRI) scans of subjects watching the "Forrest Gump" movie, emotion annotations, and eye-tracking data. For human vision the ML task is to link fMRI data with emotional annotations, and the explanations highlight the brain regions strongly correlated with the label. On the other hand, for computer vision, the input data is movie frames, and the explanations are pixel-level heatmaps. We cross-analyzed our results, linking human attention (obtained through eye-tracking) with XAI saliency on CV models and brain region activations. We show how a parallel analysis of human and computer vision can provide useful information for both the neuroscience community (allocation theory) and the ML community (biological plausibility of convolutional models).

Explainable Emotion Decoding for Human and Computer Vision

TL;DR

The paper presents a parallel analysis of emotion decoding in humans and computer vision using the StudyForrest dataset, applying LIME and SHAP to fMRI-based brain decoders and frame-level CNN explanations to movie frames. It demonstrates strong emotion and face decoding performance in both modalities and shows that explainability maps reveal a core, hierarchical brain network involving regions such as the OFC and insula, while frame saliency aligns with human eye movements mainly for faces. The cross-domain analysis links CNN attention with gaze data to explore neural correlates of attention, supporting a constructionist view of emotion processing and suggesting pathways for integrating neuroscience and ML with future RSA-based analyses. Overall, the work provides a framework for interpretable emotion decoding across biological and artificial vision systems, with implications for both neuroscience and AI research.

Abstract

Modern Machine Learning (ML) has significantly advanced various research fields, but the opaque nature of ML models hinders their adoption in several domains. Explainable AI (XAI) addresses this challenge by providing additional information to help users understand the internal decision-making process of ML models. In the field of neuroscience, enriching a ML model for brain decoding with attribution-based XAI techniques means being able to highlight which brain areas correlate with the task at hand, thus offering valuable insights to domain experts. In this paper, we analyze human and Computer Vision (CV) systems in parallel, training and explaining two ML models based respectively on functional Magnetic Resonance Imaging (fMRI) and movie frames. We do so by leveraging the "StudyForrest" dataset, which includes functional Magnetic Resonance Imaging (fMRI) scans of subjects watching the "Forrest Gump" movie, emotion annotations, and eye-tracking data. For human vision the ML task is to link fMRI data with emotional annotations, and the explanations highlight the brain regions strongly correlated with the label. On the other hand, for computer vision, the input data is movie frames, and the explanations are pixel-level heatmaps. We cross-analyzed our results, linking human attention (obtained through eye-tracking) with XAI saliency on CV models and brain region activations. We show how a parallel analysis of human and computer vision can provide useful information for both the neuroscience community (allocation theory) and the ML community (biological plausibility of convolutional models).
Paper Structure (22 sections, 6 figures, 4 tables)

This paper contains 22 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Time series of emotion annotations. The work by Lettieri et al. provides the emotion annotation by 12 indipendent human annotators of the whole Forrest Gump movie; in our work we focused on happiness, fear, sadness and anger.
  • Figure 2: Emotion decoding and XAI pipelines for brain data (A) and computer vision (B).
  • Figure 3: Brain-wise feature importance maps obtained with SHAP and LIME. Through a null model we assess the significance of each area, obtaining a limited set of regions which process most information about the emotional content of the movie.
  • Figure 4: Correlation among brain maps related to different models. The high correlation values we observed are due to the existence of a common brain network which processes information about the emotional content of a multisensory input. All the resulting correlations have a strong statistical significance, with p-values always below than 0.0002.
  • Figure 5: Area-wise correlation between Brain's explanation and overlap score.
  • ...and 1 more figures