Table of Contents
Fetching ...

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao

TL;DR

The paper tackles the limitation of relying solely on utterance-level labels for speech emotion recognition by introducing GMP-TL, a framework that derives gender-augmented multi-scale frame-level pseudo-labels (GMPs) from a pre-trained HuBERT model. It combines a multi-task, multi-scale GMP extraction phase with a two-stage fine-tuning strategy: first a CE-loss–based GMP-guided refinement, then an AM-Softmax loss guided by utterance labels, achieving strong results on IEMOCAP. Key contributions include frame-level GMP generation via multi-layer clustering, effective integration of frame- and utterance-level supervision, and novel ablation insights showing the benefits of intermediate HuBERT layers for GMPs. The approach narrows the performance gap between unimodal and multimodal SER and demonstrates practical value for robust emotional understanding in speech with reduced reliance on multimodal data.

Abstract

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.

GMP-TL: Gender-augmented Multi-scale Pseudo-label Enhanced Transfer Learning for Speech Emotion Recognition

TL;DR

The paper tackles the limitation of relying solely on utterance-level labels for speech emotion recognition by introducing GMP-TL, a framework that derives gender-augmented multi-scale frame-level pseudo-labels (GMPs) from a pre-trained HuBERT model. It combines a multi-task, multi-scale GMP extraction phase with a two-stage fine-tuning strategy: first a CE-loss–based GMP-guided refinement, then an AM-Softmax loss guided by utterance labels, achieving strong results on IEMOCAP. Key contributions include frame-level GMP generation via multi-layer clustering, effective integration of frame- and utterance-level supervision, and novel ablation insights showing the benefits of intermediate HuBERT layers for GMPs. The approach narrows the performance gap between unimodal and multimodal SER and demonstrates practical value for robust emotional understanding in speech with reduced reliance on multimodal data.

Abstract

The continuous evolution of pre-trained speech models has greatly advanced Speech Emotion Recognition (SER). However, current research typically relies on utterance-level emotion labels, inadequately capturing the complexity of emotions within a single utterance. In this paper, we introduce GMP-TL, a novel SER framework that employs gender-augmented multi-scale pseudo-label (GMP) based transfer learning to mitigate this gap. Specifically, GMP-TL initially uses the pre-trained HuBERT, implementing multi-task learning and multi-scale k-means clustering to acquire frame-level GMPs. Subsequently, to fully leverage frame-level GMPs and utterance-level emotion labels, a two-stage model fine-tuning approach is presented to further optimize GMP-TL. Experiments on IEMOCAP show that our GMP-TL attains a WAR of 80.0% and an UAR of 82.0%, achieving superior performance compared to state-of-the-art unimodal SER methods while also yielding comparable results to multimodal SER approaches.
Paper Structure (18 sections, 3 equations, 1 figure, 3 tables)

This paper contains 18 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the proposed GMP-TL framework. BiLSTM is bi-directional LSTM, MAP indicates mean average pooling, LinearP represents linear projection module, Emo. and Gend. are the abbreviations of emotion and gender.