Table of Contents
Fetching ...

Graph-based multi-Feature fusion method for speech emotion recognition

Xueyu Liu, Jie Lin, Chao Wang

TL;DR

This work tackles cross-corpus speech emotion recognition by learning a graph-based fusion of heterogeneous speech features. It introduces a three-module pipeline (Audio Feature Generation, Audio-Feature Multi-dimensional Edge Feature, and Speech Emotion Recognition) and uses five feature types (eGeMAPs, MFCCs, BoAW-e, BoAW-m, Deep Spectrum) with multi-dimensional edge features to encode pairwise feature relationships. A task-specific graph topology is learned via a GRATIS-inspired approach with cross-attention guiding edge updates, and a GRU/GCN backbone performs final $CCC$-based emotion prediction. Experiments on SEWA (German and Hungarian subsets) show significant improvements over baselines and ablations validate the benefits of TTf and AMEF; limitations include single-dataset focus and potential cross-cultural linguistic gaps, with future work toward broader datasets and modalities.

Abstract

Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.

Graph-based multi-Feature fusion method for speech emotion recognition

TL;DR

This work tackles cross-corpus speech emotion recognition by learning a graph-based fusion of heterogeneous speech features. It introduces a three-module pipeline (Audio Feature Generation, Audio-Feature Multi-dimensional Edge Feature, and Speech Emotion Recognition) and uses five feature types (eGeMAPs, MFCCs, BoAW-e, BoAW-m, Deep Spectrum) with multi-dimensional edge features to encode pairwise feature relationships. A task-specific graph topology is learned via a GRATIS-inspired approach with cross-attention guiding edge updates, and a GRU/GCN backbone performs final -based emotion prediction. Experiments on SEWA (German and Hungarian subsets) show significant improvements over baselines and ablations validate the benefits of TTf and AMEF; limitations include single-dataset focus and potential cross-cultural linguistic gaps, with future work toward broader datasets and modalities.

Abstract

Exploring proper way to conduct multi-speech feature fusion for cross-corpus speech emotion recognition is crucial as different speech features could provide complementary cues reflecting human emotion status. While most previous approaches only extract a single speech feature for emotion recognition, existing fusion methods such as concatenation, parallel connection, and splicing ignore heterogeneous patterns in the interaction between features and features, resulting in performance of existing systems. In this paper, we propose a novel graph-based fusion method to explicitly model the relationships between every pair of speech features. Specifically, we propose a multi-dimensional edge features learning strategy called Graph-based multi-Feature fusion method for speech emotion recognition. It represents each speech feature as a node and learns multi-dimensional edge features to explicitly describe the relationship between each feature-feature pair in the context of emotion recognition. This way, the learned multi-dimensional edge features encode speech feature-level information from both the vertex and edge dimensions. Our Approach consists of three modules: an Audio Feature Generation(AFG)module, an Audio-Feature Multi-dimensional Edge Feature(AMEF) module and a Speech Emotion Recognition (SER) module. The proposed methodology yielded satisfactory outcomes on the SEWA dataset. Furthermore, the method demonstrated enhanced performance compared to the baseline in the AVEC 2019 Workshop and Challenge. We used data from two cultures as our training and validation sets: two cultures containing German and Hungarian on the SEWA dataset, the CCC scores for German are improved by 17.28% for arousal and 7.93% for liking. The outcomes of our methodology demonstrate a 13% improvement over alternative fusion techniques, including those employing one dimensional edge-based feature fusion approach.
Paper Structure (21 sections, 15 equations, 4 figures, 4 tables)

This paper contains 21 sections, 15 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Comparison between different speech feature fusion methods, with examples shown as follows, Use audio features (AF), recognition results prediction, and feature vectors V. (I) Perform feature-level fusion using concat methods; (II) Perform decision-level feature fusion based on weighted recognition results (I) and (II) without using edge features; and (III) Our method Showing a unique association pattern encoded for each feature vector in the node, these features also determine the specific task topology of the graph, and additionally describe the use of multidimensional features for each edge (the relationship between a pair of AFs).
  • Figure 2: The framework of the graph-based multi-feature fusion method. For simplicity, only the processing process of an audio file is shown in the graph. The specific details of the modules are given in the corresponding sections.
  • Figure 3: The values on the vertical axis are the average CCC scores of the three emotional dimensions: arousal, valence, and linking. The horizontal axis represents the ablation model and our method.
  • Figure 4: Ablation experiments on the DE+HU cultural dataset with t-SNE.