MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

Haiyang Sun; Fulin Zhang; Yingying Gao; Zheng Lian; Shilei Zhang; Junlan Feng

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

Haiyang Sun, Fulin Zhang, Yingying Gao, Zheng Lian, Shilei Zhang, Junlan Feng

TL;DR

A novel framework for pre-training knowledge in SER is proposed, called Multi-perspective Fusion Search Network (MFSN), which partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and a new architecture search space to fully leverage them.

Abstract

Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 5 figures, 5 tables)

This paper contains 16 sections, 6 equations, 5 figures, 5 tables.

Introduction
Multiple-perspective Fusion Search Network
TEC & SEC
Search Space
Search Algorithm
Different modeling methods
Datasets
Experimental Setup & Results
Main setup
Comparison with existing works
Comparison between different modeling methods
Unified model configuration
Results
Comparison with different configurations
Visualization of Search Results
...and 1 more sections

Figures (5)

Figure 1: The overall framework of MFSN.
Figure 2: The search space we designed is divided into two parts: Choice Cell and Fusion Cell.
Figure 3: Unified training framework for different modeling.
Figure 4: The visualization of adjustment strategy search results for the Leave-one-session strategy. Here, $1$, $m$, and $k$ represent three levels of features. Red color indicates the best path.
Figure 5: Visualization of confusion matrices.

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

TL;DR

Abstract

MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (5)