Table of Contents
Fetching ...

Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

Ruoyu Zhao, Xiantao Jiang, F. Richard Yu, Victor C. M. Leung, Tao Wang, Shaohu Zhang

TL;DR

This work addresses cross-linguistic speech emotion recognition (CLSER) under data scarcity by proposing HuMP-CAT, a framework that fuses HuBERT-based representations with MFCC and prosodic features via a cross-attention transformer. The model is pretrained on the English IEMOCAP corpus and fine-tuned with limited target-language data, achieving an average accuracy of 78.75% across seven diverse datasets and languages, with substantial gains on EMODB and EMOVO. HuMP-CAT demonstrably outperforms existing CLSER approaches, validating the effectiveness of multi-feature fusion and cross-attention-based integration for cross-language emotion understanding. The approach has practical implications for scalable, language-robust SER in multilingual human-computer interaction settings, and future work proposes expanding the source dataset to further enhance generalization.

Abstract

Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.

Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition

TL;DR

This work addresses cross-linguistic speech emotion recognition (CLSER) under data scarcity by proposing HuMP-CAT, a framework that fuses HuBERT-based representations with MFCC and prosodic features via a cross-attention transformer. The model is pretrained on the English IEMOCAP corpus and fine-tuned with limited target-language data, achieving an average accuracy of 78.75% across seven diverse datasets and languages, with substantial gains on EMODB and EMOVO. HuMP-CAT demonstrably outperforms existing CLSER approaches, validating the effectiveness of multi-feature fusion and cross-attention-based integration for cross-language emotion understanding. The approach has practical implications for scalable, language-robust SER in multilingual human-computer interaction settings, and future work proposes expanding the source dataset to further enhance generalization.

Abstract

Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. Cross-Linguistic SER (CLSER) has been a challenging research problem due to significant variability in linguistic and acoustic features of different languages. In this study, we propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. These features are fused using a cross-attention transformer (CAT) mechanism during feature extraction. Transfer learning is applied to gain from a source emotional speech dataset to the target corpus for emotion recognition. We use IEMOCAP as the source dataset to train the source model and evaluate the proposed method on seven datasets in five languages (e.g., English, German, Spanish, Italian, and Chinese). We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75% across the seven datasets, with notable performance of 88.69% on EMODB (German language) and 79.48% on EMOVO (Italian language). Our extensive evaluation demonstrates that HuMP-CAT outperforms existing methods across multiple target languages.
Paper Structure (29 sections, 10 equations, 5 figures, 7 tables)

This paper contains 29 sections, 10 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: General process of SER.
  • Figure 2: Architecture of Cross-Attention Transformer.
  • Figure 3: Structure of proposed HuMP-CAT.
  • Figure 4: Confusion matrix of HuMP-CAT on IEMOCAP corpus.
  • Figure 5: Comparison of three methods on the same target dataset.