Table of Contents
Fetching ...

Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition

Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara

TL;DR

This paper addresses SER by introducing an iterative feature-boosting framework that foregrounds feature relevance and explainability. It combines a feature boosting module (guided by a Variance Ratio Criterion and PCA) with a robust classification module and an explainability module based on SHAP, forming a feedback loop that refines features over iterations. Empirical results on EMO-DB, TESS, RAVDESS, and SAVEE show state-of-the-art performance, with significant improvements and even surpassing human-level performance on TESS. The work highlights the practical value of integrating explainable AI into SER to deliver accurate, transparent emotion recognition with robust cross-dataset generalization.

Abstract

Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model's predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model's predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework. The source code of this paper is publicly available via this https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.

Unveiling Hidden Factors: Explainable AI for Feature Boosting in Speech Emotion Recognition

TL;DR

This paper addresses SER by introducing an iterative feature-boosting framework that foregrounds feature relevance and explainability. It combines a feature boosting module (guided by a Variance Ratio Criterion and PCA) with a robust classification module and an explainability module based on SHAP, forming a feedback loop that refines features over iterations. Empirical results on EMO-DB, TESS, RAVDESS, and SAVEE show state-of-the-art performance, with significant improvements and even surpassing human-level performance on TESS. The work highlights the practical value of integrating explainable AI into SER to deliver accurate, transparent emotion recognition with robust cross-dataset generalization.

Abstract

Speech emotion recognition (SER) has gained significant attention due to its several application fields, such as mental health, education, and human-computer interaction. However, the accuracy of SER systems is hindered by high-dimensional feature sets that may contain irrelevant and redundant information. To overcome this challenge, this study proposes an iterative feature boosting approach for SER that emphasizes feature relevance and explainability to enhance machine learning model performance. Our approach involves meticulous feature selection and analysis to build efficient SER systems. In addressing our main problem through model explainability, we employ a feature evaluation loop with Shapley values to iteratively refine feature sets. This process strikes a balance between model performance and transparency, which enables a comprehensive understanding of the model's predictions. The proposed approach offers several advantages, including the identification and removal of irrelevant and redundant features, leading to a more effective model. Additionally, it promotes explainability, facilitating comprehension of the model's predictions and the identification of crucial features for emotion determination. The effectiveness of the proposed method is validated on the SER benchmarks of the Toronto emotional speech set (TESS), Berlin Database of Emotional Speech (EMO-DB), Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), and Surrey Audio-Visual Expressed Emotion (SAVEE) datasets, outperforming state-of-the-art methods. To the best of our knowledge, this is the first work to incorporate model explainability into an SER framework. The source code of this paper is publicly available via this https://github.com/alaaNfissi/Unveiling-Hidden-Factors-Explainable-AI-for-Feature-Boosting-in-Speech-Emotion-Recognition.
Paper Structure (28 sections, 8 equations, 13 figures, 10 tables, 1 algorithm)

This paper contains 28 sections, 8 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: The proposed method with its main modules: a) Fature boosting module, b) Classification module, c) Model explainability module. FC stands for Feature Combination
  • Figure 2: Boosted features importance to the model's prediction
  • Figure 3: Initial features importance to the model's prediction
  • Figure 4: Biplot of the $2^{nd}$ Feature combination (EMO-DB)
  • Figure 5: Biplot of EMO-DB optimal feature combination
  • ...and 8 more figures