Table of Contents
Fetching ...

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

Haiyang Sun, Fulin Zhang, Yingying Gao, Zheng Lian, Shilei Zhang, Junlan Feng

TL;DR

A novel framework for pre-training knowledge in SER is proposed, called Multi-perspective Fusion Search Network (MFSN), which partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and a new architecture search space to fully leverage them.

Abstract

Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

TL;DR

A novel framework for pre-training knowledge in SER is proposed, called Multi-perspective Fusion Search Network (MFSN), which partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and a new architecture search space to fully leverage them.

Abstract

Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.
Paper Structure (16 sections, 6 equations, 5 figures, 5 tables)

This paper contains 16 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The overall framework of MFSN.
  • Figure 2: The search space we designed is divided into two parts: Choice Cell and Fusion Cell.
  • Figure 3: Unified training framework for different modeling.
  • Figure 4: The visualization of adjustment strategy search results for the Leave-one-session strategy. Here, $1$, $m$, and $k$ represent three levels of features. Red color indicates the best path.
  • Figure 5: Visualization of confusion matrices.