TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition
Chengxin Chen, Pengyuan Zhang
TL;DR
The paper tackles noise-robust speech emotion recognition by integrating a fixed, pre-trained speech enhancement front-end with a two-level refinement framework. TRNet uses an SNR-aware compensation to adaptively blend the original and enhanced spectrograms and a FiLM-like bridge to calibrate high-level emotion representations, guided by both low-level and high-level losses with $ abla$-balanced weights. Experiments on IEMOCAP with ESC-50 and MUSAN demonstrate that TRNet improves robustness in matched and unmatched noisy environments without sacrificing clean performance, outperforming several baselines and showing the importance of both spectral refinement and representation calibration. This approach offers a practical, modular pathway to deploy noise-resilient SER systems in real-world settings, with quantified gains in UAR/WAR and insights from ablations and visualizations.
Abstract
One persistent challenge in Speech Emotion Recognition (SER) is the ubiquitous environmental noise, which frequently results in deteriorating SER performance in practice. In this paper, we introduce a Two-level Refinement Network, dubbed TRNet, to address this challenge. Specifically, a pre-trained speech enhancement module is employed for front-end noise reduction and noise level estimation. Later, we utilize clean speech spectrograms and their corresponding deep representations as reference signals to refine the spectrogram distortion and representation shift of enhanced speech during model training. Experimental results validate that the proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments, without compromising its performance in noise-free environments.
