Table of Contents
Fetching ...

A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models

Nicolas Facchinetti, Federico Simonetta, Stavros Ntalampiras

TL;DR

This paper investigates the robustness of Speech Emotion Recognition (SER) models to adversarial attacks across three languages and gendered speech. It deploys a uniform CNN-LSTM architecture trained on log-Mel spectrograms and evaluates seven attack algorithms from the ART framework, spanning white-box and black-box settings. The results show substantial vulnerability of SER to adversarial examples, with the Jacobian-based Saliency Map Attack (JSMA) typically yielding the strongest misclassification while perturbations remain comparatively small; black-box methods like BoundaryAttack can also cause severe accuracy degradation in several cases. The study provides a baseline for defense development, attack design, and deeper analysis of language and gender differences in SER, offering actionable insights for improving the resilience of SER systems in real-world deployments.

Abstract

Speech emotion recognition (SER) is constantly gaining attention in recent years due to its potential applications in diverse fields and thanks to the possibility offered by deep learning technologies. However, recent studies have shown that deep learning models can be vulnerable to adversarial attacks. In this paper, we systematically assess this problem by examining the impact of various adversarial white-box and black-box attacks on different languages and genders within the context of SER. We first propose a suitable methodology for audio data processing, feature extraction, and CNN-LSTM architecture. The observed outcomes highlighted the significant vulnerability of CNN-LSTM models to adversarial examples (AEs). In fact, all the considered adversarial attacks are able to significantly reduce the performance of the constructed models. Furthermore, when assessing the efficacy of the attacks, minor differences were noted between the languages analyzed as well as between male and female speech. In summary, this work contributes to the understanding of the robustness of CNN-LSTM models, particularly in SER scenarios, and the impact of AEs. Interestingly, our findings serve as a baseline for a) developing more robust algorithms for SER, b) designing more effective attacks, c) investigating possible defenses, d) improved understanding of the vocal differences between different languages and genders, and e) overall, enhancing our comprehension of the SER task.

A Systematic Evaluation of Adversarial Attacks against Speech Emotion Recognition Models

TL;DR

This paper investigates the robustness of Speech Emotion Recognition (SER) models to adversarial attacks across three languages and gendered speech. It deploys a uniform CNN-LSTM architecture trained on log-Mel spectrograms and evaluates seven attack algorithms from the ART framework, spanning white-box and black-box settings. The results show substantial vulnerability of SER to adversarial examples, with the Jacobian-based Saliency Map Attack (JSMA) typically yielding the strongest misclassification while perturbations remain comparatively small; black-box methods like BoundaryAttack can also cause severe accuracy degradation in several cases. The study provides a baseline for defense development, attack design, and deeper analysis of language and gender differences in SER, offering actionable insights for improving the resilience of SER systems in real-world deployments.

Abstract

Speech emotion recognition (SER) is constantly gaining attention in recent years due to its potential applications in diverse fields and thanks to the possibility offered by deep learning technologies. However, recent studies have shown that deep learning models can be vulnerable to adversarial attacks. In this paper, we systematically assess this problem by examining the impact of various adversarial white-box and black-box attacks on different languages and genders within the context of SER. We first propose a suitable methodology for audio data processing, feature extraction, and CNN-LSTM architecture. The observed outcomes highlighted the significant vulnerability of CNN-LSTM models to adversarial examples (AEs). In fact, all the considered adversarial attacks are able to significantly reduce the performance of the constructed models. Furthermore, when assessing the efficacy of the attacks, minor differences were noted between the languages analyzed as well as between male and female speech. In summary, this work contributes to the understanding of the robustness of CNN-LSTM models, particularly in SER scenarios, and the impact of AEs. Interestingly, our findings serve as a baseline for a) developing more robust algorithms for SER, b) designing more effective attacks, c) investigating possible defenses, d) improved understanding of the vocal differences between different languages and genders, and e) overall, enhancing our comprehension of the SER task.
Paper Structure (47 sections, 5 equations, 14 figures, 48 tables)

This paper contains 47 sections, 5 equations, 14 figures, 48 tables.

Figures (14)

  • Figure 1: Flowchart of the proposed methodology to conduct the experiment.
  • Figure 2: Example of split and repeat process on log Mel-spectrograms. Original log Mel-spectrogram (a), the sliced segments (b) and (c), and segment (c) repeated to 3 seconds (d).
  • Figure 3: Architecture of the optimized CNN-LSTM model.
  • Figure 4: Time (s) required to generate the AEs for the various attacks and datasets for the best-performing configuration. Additional data can be found in Section \ref{['resultsfinal']}.
  • Figure 5: Accuracy obtained by the most effective configuration of each attack. Additional data can be found in Section \ref{['resultsfinal']}.
  • ...and 9 more figures