Table of Contents
Fetching ...

A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition

Xiaohan Shi, Jiajun He, Xingfeng Li, Tomoki Toda

TL;DR

This work tackles noise-robust speech emotion recognition (NSER) by leveraging large-scale automatic speech recognition (ASR) representations as noise-robust features. It introduces an ASR-based NSER framework with an embedding module and a Layer Adapter that fuses multi-layer encoder and decoder features, exploiting both acoustic-phonetic and semantic cues. Across MELD, IEMOCAP, and cross-lingual CASIA variants, the approach consistently outperforms traditional denoising, high-level features, and self-supervised learning baselines, even surpassing text-based transcripts in several settings. The study also characterizes the layer-wise contributions, robustness across noise intensities, and cross-lacial generalization, providing practical insights for deploying ASR-enhanced NSER in real-world environments.

Abstract

This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.

A Comprehensive Study on the Effectiveness of ASR Representations for Noise-Robust Speech Emotion Recognition

TL;DR

This work tackles noise-robust speech emotion recognition (NSER) by leveraging large-scale automatic speech recognition (ASR) representations as noise-robust features. It introduces an ASR-based NSER framework with an embedding module and a Layer Adapter that fuses multi-layer encoder and decoder features, exploiting both acoustic-phonetic and semantic cues. Across MELD, IEMOCAP, and cross-lingual CASIA variants, the approach consistently outperforms traditional denoising, high-level features, and self-supervised learning baselines, even surpassing text-based transcripts in several settings. The study also characterizes the layer-wise contributions, robustness across noise intensities, and cross-lacial generalization, providing practical insights for deploying ASR-enhanced NSER in real-world environments.

Abstract

This paper proposes an efficient attempt to noisy speech emotion recognition (NSER). Conventional NSER approaches have proven effective in mitigating the impact of artificial noise sources, such as white Gaussian noise, but are limited to non-stationary noises in real-world environments due to their complexity and uncertainty. To overcome this limitation, we introduce a new method for NSER by adopting the automatic speech recognition (ASR) model as a noise-robust feature extractor to eliminate non-vocal information in noisy speech. We first obtain intermediate layer information from the ASR model as a feature representation for emotional speech and then apply this representation for the downstream NSER task. Our experimental results show that 1) the proposed method achieves better NSER performance compared with the conventional noise reduction method, 2) outperforms self-supervised learning approaches, and 3) even outperforms text-based approaches using ASR transcription or the ground truth transcription of noisy speech.
Paper Structure (23 sections, 9 equations, 6 figures, 7 tables)

This paper contains 23 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Structure of the NSER via ASR representations framework, where (a) represents the overall architecture, (b) details the structure of the L-adapter, and (c) outlines the structure of emotion recognition classification.
  • Figure 2: Amount of emotion distribution in MELD, IEMOCAP, and CASIA.
  • Figure 3: Layer-wise performance (F1) of ASR (Whisper) representations.
  • Figure 4: Layer-wise (F1) heatmaps of SSL models.
  • Figure 5: Class-wise robustness analysis under varying SNR conditions. Abbreviations of class labels are as follows: N (Neutral), H (Happy), A (Angry), and S (Sad).
  • ...and 1 more figures