Table of Contents
Fetching ...

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

Jinyi Mi, Xiaohan Shi, Ding Ma, Jiajun He, Takuya Fujimura, Tomoki Toda

TL;DR

This work tackles robust speech emotion recognition (SER) in the presence of human speech noise by proposing a two-stage framework that cascades Target Speaker Extraction (TSE) with SER. The system pretrains a TSE model to isolate a target speaker from a mixture, then trains or jointly trains SER on the denoised speech, with base and fine-tuned variants. Results show a significant UA improvement (up to 14.33 percentage points) over baselines, and the joint TSE-SER training further boosts performance, especially in different-gender mixtures. The approach offers practical robustness to human speech interference and points to future extensions to other attributes and more challenging multi-interference scenarios.

Abstract

Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture.

Two-stage Framework for Robust Speech Emotion Recognition Using Target Speaker Extraction in Human Speech Noise Conditions

TL;DR

This work tackles robust speech emotion recognition (SER) in the presence of human speech noise by proposing a two-stage framework that cascades Target Speaker Extraction (TSE) with SER. The system pretrains a TSE model to isolate a target speaker from a mixture, then trains or jointly trains SER on the denoised speech, with base and fine-tuned variants. Results show a significant UA improvement (up to 14.33 percentage points) over baselines, and the joint TSE-SER training further boosts performance, especially in different-gender mixtures. The approach offers practical robustness to human speech interference and points to future extensions to other attributes and more challenging multi-interference scenarios.

Abstract

Developing a robust speech emotion recognition (SER) system in noisy conditions faces challenges posed by different noise properties. Most previous studies have not considered the impact of human speech noise, thus limiting the application scope of SER. In this paper, we propose a novel two-stage framework for the problem by cascading target speaker extraction (TSE) method and SER. We first train a TSE model to extract the speech of target speaker from a mixture. Then, in the second stage, we utilize the extracted speech for SER training. Additionally, we explore a joint training of TSE and SER models in the second stage. Our developed system achieves a 14.33% improvement in unweighted accuracy (UA) compared to a baseline without using TSE method, demonstrating the effectiveness of our framework in mitigating the impact of human speech noise. Moreover, we conduct experiments considering speaker gender, showing that our framework performs particularly well in different-gender mixture.
Paper Structure (15 sections, 6 equations, 2 figures, 6 tables)