Table of Contents
Fetching ...

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

Heqing Zou, Fengmao Lv, Desheng Zheng, Eng Siong Chng, Deepu Rajan

TL;DR

This work tackles zero-shot multilingual speech emotion recognition by marrying contrastive learning with large-language-model grounded emotion reasoning. It introduces a two-stage training framework that first aligns emotion-aware speech with language features using English data, then refines language-agnostic representations with a synthetic multilingual dataset, M5SER. The approach relies on a frozen Whisper encoder, an Emotion Q-Former connector, and LLaMA 3 to predict emotion words, achieving competitive traditional SER performance and state-of-the-art-like zero-shot MSER on unseen languages. The creation of M5SER and the demonstrated cross-language generalization significantly advance practical MSER deployment in diverse linguistic contexts.

Abstract

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

Large Language Models Meet Contrastive Learning: Zero-Shot Emotion Recognition Across Languages

TL;DR

This work tackles zero-shot multilingual speech emotion recognition by marrying contrastive learning with large-language-model grounded emotion reasoning. It introduces a two-stage training framework that first aligns emotion-aware speech with language features using English data, then refines language-agnostic representations with a synthetic multilingual dataset, M5SER. The approach relies on a frozen Whisper encoder, an Emotion Q-Former connector, and LLaMA 3 to predict emotion words, achieving competitive traditional SER performance and state-of-the-art-like zero-shot MSER on unseen languages. The creation of M5SER and the demonstrated cross-language generalization significantly advance practical MSER deployment in diverse linguistic contexts.

Abstract

Multilingual speech emotion recognition aims to estimate a speaker's emotional state using a contactless method across different languages. However, variability in voice characteristics and linguistic diversity poses significant challenges for zero-shot speech emotion recognition, especially with multilingual datasets. In this paper, we propose leveraging contrastive learning to refine multilingual speech features and extend large language models for zero-shot multilingual speech emotion estimation. Specifically, we employ a novel two-stage training framework to align speech signals with linguistic features in the emotional space, capturing both emotion-aware and language-agnostic speech representations. To advance research in this field, we introduce a large-scale synthetic multilingual speech emotion dataset, M5SER. Our experiments demonstrate the effectiveness of the proposed method in both speech emotion recognition and zero-shot multilingual speech emotion recognition, including previously unseen datasets and languages.

Paper Structure

This paper contains 24 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: M5SER: (a) Emotion distribution, (b) Language distribution.
  • Figure 2: Comparison of emotional audio distribution among English and languages in M5SER.
  • Figure 3: Overview of multilingual speech emotion recognition framework.
  • Figure 4: Audio distribution with t-SNE embedding.