Table of Contents
Fetching ...

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

Rui Liu, Zening Ma

TL;DR

This work introduces EMS, an emotional masking strategy that injects frame-level emotion intensity into self-supervised speech representation learning. By applying intensity-guided masking to two SSL backbones, Mockingjay and NPC, and coupling an intensity extractor (Strengthnet) with a linear alignment step, the method yields emotion-aware representations that improve SER on IEMOCAP and show generalization to phoneme recognition and intent classification. The results demonstrate that higher-emotion frames drive masking and representation learning, achieving significant SER gains (e.g., up to ~7.1 percentage points) and better downstream performance. The approach offers a practical path toward more emotionally informed speech models with broader applicability across SLP tasks.

Abstract

Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.

Emotion-Aware Speech Self-Supervised Representation Learning with Intensity Knowledge

TL;DR

This work introduces EMS, an emotional masking strategy that injects frame-level emotion intensity into self-supervised speech representation learning. By applying intensity-guided masking to two SSL backbones, Mockingjay and NPC, and coupling an intensity extractor (Strengthnet) with a linear alignment step, the method yields emotion-aware representations that improve SER on IEMOCAP and show generalization to phoneme recognition and intent classification. The results demonstrate that higher-emotion frames drive masking and representation learning, achieving significant SER gains (e.g., up to ~7.1 percentage points) and better downstream performance. The approach offers a practical path toward more emotionally informed speech models with broader applicability across SLP tasks.

Abstract

Speech Self-Supervised Learning (SSL) has demonstrated considerable efficacy in various downstream tasks. Nevertheless, prevailing self-supervised models often overlook the incorporation of emotion-related prior information, thereby neglecting the potential enhancement of emotion task comprehension through emotion prior knowledge in speech. In this paper, we propose an emotion-aware speech representation learning with intensity knowledge. Specifically, we extract frame-level emotion intensities using an established speech-emotion understanding model. Subsequently, we propose a novel emotional masking strategy (EMS) to incorporate emotion intensities into the masking process. We selected two representative models based on Transformer and CNN, namely MockingJay and Non-autoregressive Predictive Coding (NPC), and conducted experiments on IEMOCAP dataset. Experiments have demonstrated that the representations derived from our proposed method outperform the original model in SER task.
Paper Structure (16 sections, 3 equations, 1 figure, 3 tables)

This paper contains 16 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Proposed model architecture. The small rectangles in the figure indicate the frame-level emotional intensity scores or acoustic features, those with numbers indicate emotional intensity scores, and the white parts indicate masked. The self-supervised model represents the encoder part of the improved model.