Fitting Different Interactive Information: Joint Classification of Emotion and Intention
Xinger Li, Zhiqiang Zhong, Bo Huang, Yang Yang
TL;DR
The study tackles joint emotion and intention recognition under limited labeled data. It combines pseudo-labeling with high-confidence unlabeled samples ($>0.99$) and a cross-task fusion framework that uses multi-head self-attention and a gated interaction across visual, audio, and Chinese text modalities. The key findings are that intention is more readily captured by emotional cues and that distinct attention heads optimize emotion vs intention pathways, yielding a final performance of $0.5532$ on the MEIJU25 Track I test set. The approach demonstrates effective use of unlabeled data and inter-task interaction to achieve strong results in low-resource multimodal emotion and intention recognition with practical implications for real-time video analysis and human-computer interaction.
Abstract
This paper is the first-place solution for ICASSP MEIJU@2025 Track I, which focuses on low-resource multimodal emotion and intention recognition. How to effectively utilize a large amount of unlabeled data, while ensuring the mutual promotion of different difficulty levels tasks in the interaction stage, these two points become the key to the competition. In this paper, pseudo-label labeling is carried out on the model trained with labeled data, and samples with high confidence and their labels are selected to alleviate the problem of low resources. At the same time, the characteristic of easy represented ability of intention recognition found in the experiment is used to make mutually promote with emotion recognition under different attention heads, and higher performance of intention recognition is achieved through fusion. Finally, under the refined processing data, we achieve the score of 0.5532 in the Test set, and win the championship of the track.
