Table of Contents
Fetching ...

VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition

Puneet Kumar, Sarthak Malik, Balasubramanian Raman, Xiaobai Li

TL;DR

VISTANet advances multimodal emotion recognition by integrating image, speech, and text through a hybrid of intermediate and late fusion, with learned per-sample weights. The accompanying KAAP interpretability method provides quantitative, feature-level contributions from each modality to a predicted emotion, enabling transparent reasoning. The IIT-R MMEmoRec dataset supports evaluation across diverse image types and modalities, with Set A high-confidence and Set B lower-confidence data to probe robustness. Empirically, VISTANet achieves strong performance (Set A 95.99%, Set B 75.13%, overall 80.11%) while offering faster, per-sample explanations via KAAP compared to SHAP, and demonstrates meaningful modality contributions across emotions and tasks, including sentiment classification on BT4SA.

Abstract

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTANet), to classify emotions reflected by input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTANet fuses information from image, speech, and text modalities using a hybrid of intermediate and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labelled with discrete emotion classes, we have constructed the IIT-R MMEmoRec dataset consisting of images, corresponding speech and text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad'). The VISTANet has resulted in an overall emotion recognition accuracy of 80.11% on the IIT-R MMEmoRec dataset using visual, spoken, and textual modalities, outperforming single or dual-modality configurations. The code and data can be accessed at https://github.com/MIntelligence-Group/MMEmoRec.

VISTANet: VIsual Spoken Textual Additive Net for Interpretable Multimodal Emotion Recognition

TL;DR

VISTANet advances multimodal emotion recognition by integrating image, speech, and text through a hybrid of intermediate and late fusion, with learned per-sample weights. The accompanying KAAP interpretability method provides quantitative, feature-level contributions from each modality to a predicted emotion, enabling transparent reasoning. The IIT-R MMEmoRec dataset supports evaluation across diverse image types and modalities, with Set A high-confidence and Set B lower-confidence data to probe robustness. Empirically, VISTANet achieves strong performance (Set A 95.99%, Set B 75.13%, overall 80.11%) while offering faster, per-sample explanations via KAAP compared to SHAP, and demonstrates meaningful modality contributions across emotions and tasks, including sentiment classification on BT4SA.

Abstract

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTANet), to classify emotions reflected by input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has been developed that identifies important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTANet fuses information from image, speech, and text modalities using a hybrid of intermediate and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labelled with discrete emotion classes, we have constructed the IIT-R MMEmoRec dataset consisting of images, corresponding speech and text, and emotion labels ('angry,' 'happy,' 'hate,' and 'sad'). The VISTANet has resulted in an overall emotion recognition accuracy of 80.11% on the IIT-R MMEmoRec dataset using visual, spoken, and textual modalities, outperforming single or dual-modality configurations. The code and data can be accessed at https://github.com/MIntelligence-Group/MMEmoRec.
Paper Structure (33 sections, 10 equations, 7 figures, 9 tables)

This paper contains 33 sections, 10 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Example of emotion label determination.
  • Figure 2: Determining threshold confidence for dataset construction.
  • Figure 3: Schematic architecture of the proposed multimodal emotion recognition system. Here, $\mathbf{P_m}$ & $\mathbf{S_m}$ denote the pre-trained & simpler networks for $m^{th}$ modality whereas 'i,' 's,' and 't' denote visual, speech and text modalities, respectively. The initial six blocks represent pairwise intermediate fusion and the final block illustrates the late fusion.
  • Figure 4: Schematic representation of the proposed interpretability technique. The symbols $k_i$, $k_s$, and $k_t$ represent number of image, speech, and text partitions; $w_i$, and $w_s$ are the widths for image & speech feature matrices, and $L_t$ is the length of the text feature vector.
  • Figure 5: Sample model for KP values computation.
  • ...and 2 more figures