Table of Contents
Fetching ...

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

Mengying Ge, Mingyang Li, Dongkai Tang, Pengbo Li, Kuo Liu, Shuhao Deng, Songbai Pu, Long Liu, Yang Song, Tao Zhang

TL;DR

The paper tackles robust multimodal emotion recognition under data scarcity and noisy conditions by proposing an early Audio-Text joint feature extractor built atop a large language model, complemented by strong unimodal encoders (EmotionViT, Chinese HuBERT_large, Baichuan13B_Chat). It integrates noise-robust preprocessing (ASR enhancement and MossFormer2 denoising) with a semi-supervised data mining and ensemble strategy to leverage unlabeled data. The proposed approach achieves high performance on MER2024-SEMI and MER2024-NOISE, ranking 2nd and attaining weighted metrics around 0.90 and 0.84, respectively, demonstrating effective cross-modal integration and robustness. These findings highlight the practical potential of deep Audio-Text collaboration and semi-supervised learning for real-world emotion understanding, while also pointing to future work on Multimodal Large Language Models to address residual challenges in visual and Chinese text modalities.

Abstract

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.

Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better

TL;DR

The paper tackles robust multimodal emotion recognition under data scarcity and noisy conditions by proposing an early Audio-Text joint feature extractor built atop a large language model, complemented by strong unimodal encoders (EmotionViT, Chinese HuBERT_large, Baichuan13B_Chat). It integrates noise-robust preprocessing (ASR enhancement and MossFormer2 denoising) with a semi-supervised data mining and ensemble strategy to leverage unlabeled data. The proposed approach achieves high performance on MER2024-SEMI and MER2024-NOISE, ranking 2nd and attaining weighted metrics around 0.90 and 0.84, respectively, demonstrating effective cross-modal integration and robustness. These findings highlight the practical potential of deep Audio-Text collaboration and semi-supervised learning for real-world emotion understanding, while also pointing to future work on Multimodal Large Language Models to address residual challenges in visual and Chinese text modalities.

Abstract

In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
Paper Structure (16 sections, 1 equation, 4 figures, 4 tables)

This paper contains 16 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Multi-Modal Emotion Recognition Framework
  • Figure 3: Illustration of Joint Audio-Text Module
  • Figure 4: MuiltiModel Ensemble Strategy
  • Figure 5: The trend of model indicators with data mining iteration.