Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge
Rui Liu, Zening Ma
TL;DR
This work introduces EMS, an emotional masking strategy that injects frame-level emotion intensity into self-supervised speech representation learning. By applying intensity-guided masking to two SSL backbones, Mockingjay and NPC, and coupling an intensity extractor (Strengthnet) with a linear alignment step, the method yields emotion-aware representations that improve SER on IEMOCAP and show generalization to phoneme recognition and intent classification. The results demonstrate that higher-emotion frames drive masking and representation learning, achieving significant SER gains (e.g., up to ~7.1 percentage points) and better downstream performance. The approach offers a practical path toward more emotionally informed speech models with broader applicability across SLP tasks.
Abstract
Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.
