Table of Contents
Fetching ...

Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso

TL;DR

This paper tackles the problem of speech emotion recognition under real-world noisy conditions by introducing TG-EAT, a text-guided environment-aware training framework that conditions a transformer-based SER model on text descriptions of the testing environment. It leverages two text-embedding paths (CLAP-based and RoBERTa-based) to fuse environment information with acoustic representations, enabling zero-shot adaptation to unseen noises. Empirical results on MSP-Podcast with Freesound and DEMAND noise show that LLM-based environment embeddings substantially improve noise robustness, with large gains at -5 dB SNR and notable cross-corpus generalization, especially when fine-tuning the text encoder. The work demonstrates that explicit environment descriptions can be a scalable, single-model solution for multi-noise SER deployment and suggests future directions for dynamic fusion and automatic environmental captioning.

Abstract

Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. With a contrastive learning (CL)-based representation, our proposed method can be improved by jointly fine-tuning the text encoder with the emotion recognition model. Under the -5dB signal-to-noise ratio (SNR) level, fine-tuning the text encoder improves our CL-based representation method by 76.4% (arousal), 100.0% (dominance), and 27.7% (valence).

Describe Where You Are: Improving Noise-Robustness for Speech Emotion Recognition with Text Description of the Environment

TL;DR

This paper tackles the problem of speech emotion recognition under real-world noisy conditions by introducing TG-EAT, a text-guided environment-aware training framework that conditions a transformer-based SER model on text descriptions of the testing environment. It leverages two text-embedding paths (CLAP-based and RoBERTa-based) to fuse environment information with acoustic representations, enabling zero-shot adaptation to unseen noises. Empirical results on MSP-Podcast with Freesound and DEMAND noise show that LLM-based environment embeddings substantially improve noise robustness, with large gains at -5 dB SNR and notable cross-corpus generalization, especially when fine-tuning the text encoder. The work demonstrates that explicit environment descriptions can be a scalable, single-model solution for multi-noise SER deployment and suggests future directions for dynamic fusion and automatic environmental captioning.

Abstract

Speech emotion recognition (SER) systems often struggle in real-world environments, where ambient noise severely degrades their performance. This paper explores a novel approach that exploits prior knowledge of testing environments to maximize SER performance under noisy conditions. To address this task, we propose a text-guided, environment-aware training where an SER model is trained with contaminated speech samples and their paired noise description. We use a pre-trained text encoder to extract the text-based environment embedding and then fuse it to a transformer-based SER model during training and inference. We demonstrate the effectiveness of our approach through our experiment with the MSP-Podcast corpus and real-world additive noise samples collected from the Freesound and DEMAND repositories. Our experiment indicates that the text-based environment descriptions processed by a large language model (LLM) produce representations that improve the noise-robustness of the SER system. With a contrastive learning (CL)-based representation, our proposed method can be improved by jointly fine-tuning the text encoder with the emotion recognition model. Under the -5dB signal-to-noise ratio (SNR) level, fine-tuning the text encoder improves our CL-based representation method by 76.4% (arousal), 100.0% (dominance), and 27.7% (valence).
Paper Structure (19 sections, 6 figures, 9 tables)

This paper contains 19 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Our proposed text-guided environment-aware training framework. The environment representation is concatenated with the output of the convolutional feature encoder.
  • Figure 2: Embedding differences in the first and the last transformer encoder layers using clean and noisy speech in the -5dB condition. We use the wavlm-base-plus feature vector in this analysis. (a) illustrates the mean square error (MSE) between the clean and noisy representations, where both representations are extracted from each of the final models. (b) illustrates the MSE between the clean representation extracted from the Original model and the noisy representation extracted from each final model.
  • Figure 3: Visualization of text-based environment embeddings. We use UMAP to project text embeddings into 2D space.
  • Figure 4: Comparison of the text embedding projection obtained before fine-tuning (TG-EAT-CL) and after fine-tuning (TG-EAT-CL-FT). Similar to the plots in Figure \ref{['fig:umap_analysis']}, we use UMAP to project text embeddings into a 2D space.
  • Figure 5: Relative improvement of fine-tuning the text encoder (TG-EAT-CL-FT) in the TG-EAT-CL framework under the -5dB condition. We illustrate 16 environments, including the eight highest and the eight lowest improvements for each attribute.
  • ...and 1 more figures