Table of Contents
Fetching ...

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

Pujin Shi, Fei Gao

TL;DR

The paper addresses multimodal emotion recognition under limited labeled data by combining semi-supervised learning with a robust fusion framework. It introduces Audio-Guided Transformer (AGT) fusion, which uses Hubert-large audio features to guide visual and textual modalities via Context-Based Transformer and Adaptive Multimodal Fusion. It also fine-tunes CLIP-vit-large and Baichuan-13B on labeled data and leverages high-confidence pseudo-labels to expand the training set. A prior-knowledge voting mechanism mitigates label-imbalance between training and test sets, improving F1 scores. On the MER-SEMI 2024 track, the approach achieves 89.83% MER-SEMI and ranks third, highlighting the value of audio-guided fusion and semi-supervised strategies for real-world affective computing.

Abstract

In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.

Audio-Guided Fusion Techniques for Multimodal Emotion Analysis

TL;DR

The paper addresses multimodal emotion recognition under limited labeled data by combining semi-supervised learning with a robust fusion framework. It introduces Audio-Guided Transformer (AGT) fusion, which uses Hubert-large audio features to guide visual and textual modalities via Context-Based Transformer and Adaptive Multimodal Fusion. It also fine-tunes CLIP-vit-large and Baichuan-13B on labeled data and leverages high-confidence pseudo-labels to expand the training set. A prior-knowledge voting mechanism mitigates label-imbalance between training and test sets, improving F1 scores. On the MER-SEMI 2024 track, the approach achieves 89.83% MER-SEMI and ranks third, highlighting the value of audio-guided fusion and semi-supervised strategies for real-world affective computing.

Abstract

In this paper, we propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024. First, in order to enhance the performance of the feature extractor on sentiment classification tasks,we fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data. This approach effectively preserves the original emotional information conveyed in the videos. Second, we propose an Audio-Guided Transformer (AGT) fusion mechanism, which leverages the robustness of Hubert-large, showing superior effectiveness in fusing both inter-channel and intra-channel information. Third, To enhance the accuracy of the model, we iteratively apply self-supervised learning by using high-confidence unlabeled data as pseudo-labels. Finally, through black-box probing, we discovered an imbalanced data distribution between the training and test sets. Therefore, We adopt a prior-knowledge-based voting mechanism. The results demonstrate the effectiveness of our strategy, ultimately earning us third place in the MER-SEMI track.
Paper Structure (7 sections, 2 equations, 3 figures, 2 tables)

This paper contains 7 sections, 2 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our multimodal emotional feature fusion framework. (CBT: Context-based Transformer. AMF: Adaptive Multimodal Fusion.Stage 1: Train using labeled data; Stage 2: Generate pseudo-labels using unlabeled data and add them back to the training set for further training.)
  • Figure 2: Fine-tuning the feature extractors for both visual and textual modalities
  • Figure 3: Distribution of the training and testing sets across the six emotional labels