Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AI
Shengdong Xu, Zhouyang Chi, Yang Yang
TL;DR
This work tackles multimodal emotion prediction across languages and cultures using ArtELingo, addressing modal imbalance and language-cultural differences. It combines XLM-R–based unimodal signals with a modified X$^2$-VLM multimodal backbone under an Emotion-Cultural specific prompt (ECSP), augmented by retrieval-based pseudo-labels and test-time augmentation. Empirically, the ECSP-enabled ensemble achieves a final F1 score of $0.627$ and an accuracy of $0.741$, outperforming baselines and ablated variants, with results indicating that text modality carries more predictive power for emotion than image alone. The approach demonstrates a practical path toward more culturally aware, multilingual multimodal emotion understanding, suggesting broader applicability in cross-lingual AI systems.
Abstract
This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.
