SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Zebang Cheng; Shuyuan Tu; Dawei Huang; Minghan Li; Xiaojiang Peng; Zhi-Qi Cheng; Alexander G. Hauptmann

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Zebang Cheng, Shuyuan Tu, Dawei Huang, Minghan Li, Xiaojiang Peng, Zhi-Qi Cheng, Alexander G. Hauptmann

TL;DR

The paper tackles robust multimodal emotion recognition under limited labeled data by combining Emotion-LLaMA driven pseudo-labeling with a Conv-Attention fusion mechanism. It shows state-of-the-art performance on the MER-NOISE track and strong open-vocabulary results on MER-OV, largely due to synthetic-label augmentation and a hybrid fusion design that blends local semantic detail with global context. Key contributions include a comprehensive feature engineering pipeline across audio, video, and text, an effective pseudo-labeling regime for unlabeled MER data, and a lightweight fusion module that balances convolutional inductive biases with attention. The work demonstrates practical impact by improving robustness to noise and enabling richer open-vocabulary emotion descriptions, relevant for HCI, multimedia analysis, and affective computing.

Abstract

This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at https://github.com/ZebangCheng/Emotion-LLaMA.

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 2 figures, 10 tables)

This paper contains 26 sections, 7 equations, 2 figures, 10 tables.

Introduction
Related Work
Multimodal Emotion Recognition
Large Models in Emotion Understanding
Methodology
Multimodal Feature Engineering
Auditory Modality
Visual Modality
Textual Modality
Emotion-LLaMA Pseudo-Labeling
Prompt Design and Data Processing
Keyword Extraction and Dataset Augmentation
Multimodal Feature Fusion
Experiments
Single-Modal Performance on Track 2: MER-NOISE
...and 11 more sections

Figures (2)

Figure 1: Overview of the Emotion-LLaMA architecture, which integrates audio, visual, and text inputs for advanced multimodal emotion recognition and reasoning. The model aligns and fuses audio and visual features into a shared semantic space, thereby enhancing the contextual understanding of textual inputs. Emotion-LLaMA leverages multiple visual encoders to capture global, local, and temporal visual aspects, which are then combined with audio and text features to generate detailed emotion descriptions. For further details, refer to the original Emotion-LLaMA paper Emotion-LLaMA.
Figure 2: Overview of our framework for MER2024. In the feature extraction phase, frozen encoders extract features from text, video, and audio, which are pooled to integrate multimodal information. In the feature fusion stage, our Conv-Attention mechanism is applied, as detailed in part (b) of the figure. The pre-trained Emotion-LLaMA Emotion-LLaMA model generates pseudo-labels, which are combined with original labeled data, enhancing the dataset through augmentation. Finally, the augmented dataset is used to train the Conv-Attention model, boosting the performance and robustness of our emotion recognition system.

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

TL;DR

Abstract

SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)