Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

Rui Liu; Zening Ma

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

Rui Liu, Zening Ma

TL;DR

This work introduces EMS, an emotional masking strategy that injects frame-level emotion intensity into self-supervised speech representation learning. By applying intensity-guided masking to two SSL backbones, Mockingjay and NPC, and coupling an intensity extractor (Strengthnet) with a linear alignment step, the method yields emotion-aware representations that improve SER on IEMOCAP and show generalization to phoneme recognition and intent classification. The results demonstrate that higher-emotion frames drive masking and representation learning, achieving significant SER gains (e.g., up to ~7.1 percentage points) and better downstream performance. The approach offers a practical path toward more emotionally informed speech models with broader applicability across SLP tasks.

Abstract

Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 1 figure, 3 tables)

This paper contains 16 sections, 3 equations, 1 figure, 3 tables.

Introduction
Proposed Method
Task Definition and Overall workflow
Knowledge Acquisition and Pre-training Task
Knowledge Acquisition
Pre-training Task
The pre-training model
Mockingjay with EMS
NPC with EMS
Experiments and Results
Dataset
Experimental Setup
Results on the SER Task
Analysis on Generalization Ability
Conclusion
...and 1 more sections

Figures (1)

Figure 1: Proposed model architecture. The small rectangles in the figure indicate the frame-level emotional intensity scores or acoustic features, those with numbers indicate emotional intensity scores, and the white parts indicate masked. The self-supervised model represents the encoder part of the improved model.

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

TL;DR

Abstract

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

Authors

TL;DR

Abstract

Table of Contents

Figures (1)