Table of Contents
Fetching ...

TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition

Chengxin Chen, Pengyuan Zhang

TL;DR

The paper tackles noise-robust speech emotion recognition by integrating a fixed, pre-trained speech enhancement front-end with a two-level refinement framework. TRNet uses an SNR-aware compensation to adaptively blend the original and enhanced spectrograms and a FiLM-like bridge to calibrate high-level emotion representations, guided by both low-level and high-level losses with $ abla$-balanced weights. Experiments on IEMOCAP with ESC-50 and MUSAN demonstrate that TRNet improves robustness in matched and unmatched noisy environments without sacrificing clean performance, outperforming several baselines and showing the importance of both spectral refinement and representation calibration. This approach offers a practical, modular pathway to deploy noise-resilient SER systems in real-world settings, with quantified gains in UAR/WAR and insights from ablations and visualizations.

Abstract

One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.

TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition

TL;DR

The paper tackles noise-robust speech emotion recognition by integrating a fixed, pre-trained speech enhancement front-end with a two-level refinement framework. TRNet uses an SNR-aware compensation to adaptively blend the original and enhanced spectrograms and a FiLM-like bridge to calibrate high-level emotion representations, guided by both low-level and high-level losses with -balanced weights. Experiments on IEMOCAP with ESC-50 and MUSAN demonstrate that TRNet improves robustness in matched and unmatched noisy environments without sacrificing clean performance, outperforming several baselines and showing the importance of both spectral refinement and representation calibration. This approach offers a practical, modular pathway to deploy noise-resilient SER systems in real-world settings, with quantified gains in UAR/WAR and insights from ablations and visualizations.

Abstract

One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
Paper Structure (19 sections, 7 equations, 3 figures, 2 tables)

This paper contains 19 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Diagram of our proposed TRNet. The red arrows are activated during both the training and inference phases, while the black arrows are activated only in the training phase.
  • Figure 2: Estimated SNR coefficients in different noisy environments. (a) IEM-ESC; (b) IEM-MUSAN.
  • Figure 3: T-SNE visualization of deep emotion representations. 500 clean utterances were randomly selected and contaminated by noise samples from ESC-50 at different SNRs (20 dB, 10 dB, and 0 dB). The smaller semi-transparent points represent specific samples, while the larger opaque ones correspond to centers of different distributions. (a) Baseline-c; (b) Baseline-e; (c) TRNet.