Table of Contents
Fetching ...

Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AI

Shengdong Xu, Zhouyang Chi, Yang Yang

TL;DR

This work tackles multimodal emotion prediction across languages and cultures using ArtELingo, addressing modal imbalance and language-cultural differences. It combines XLM-R–based unimodal signals with a modified X$^2$-VLM multimodal backbone under an Emotion-Cultural specific prompt (ECSP), augmented by retrieval-based pseudo-labels and test-time augmentation. Empirically, the ECSP-enabled ensemble achieves a final F1 score of $0.627$ and an accuracy of $0.741$, outperforming baselines and ablated variants, with results indicating that text modality carries more predictive power for emotion than image alone. The approach demonstrates a practical path toward more culturally aware, multilingual multimodal emotion understanding, suggesting broader applicability in cross-lingual AI systems.

Abstract

This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X$^2$-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.

Solution for Emotion Prediction Competition of Workshop on Emotionally and Culturally Intelligent AI

TL;DR

This work tackles multimodal emotion prediction across languages and cultures using ArtELingo, addressing modal imbalance and language-cultural differences. It combines XLM-R–based unimodal signals with a modified X-VLM multimodal backbone under an Emotion-Cultural specific prompt (ECSP), augmented by retrieval-based pseudo-labels and test-time augmentation. Empirically, the ECSP-enabled ensemble achieves a final F1 score of and an accuracy of , outperforming baselines and ablated variants, with results indicating that text modality carries more predictive power for emotion than image alone. The approach demonstrates a practical path toward more culturally aware, multilingual multimodal emotion understanding, suggesting broader applicability in cross-lingual AI systems.

Abstract

This report provide a detailed description of the method that we explored and proposed in the WECIA Emotion Prediction Competition (EPC), which predicts a person's emotion through an artistic work with a comment. The dataset of this competition is ArtELingo, designed to encourage work on diversity across languages and cultures. The dataset has two main challenges, namely modal imbalance problem and language-cultural differences problem. In order to address this issue, we propose a simple yet effective approach called single-multi modal with Emotion-Cultural specific prompt(ECSP), which focuses on using the single modal message to enhance the performance of multimodal models and a well-designed prompt to reduce cultural differences problem. To clarify, our approach contains two main blocks: (1)XLM-R\cite{conneau2019unsupervised} based unimodal model and X-VLM\cite{zeng2022x} based multimodal model (2) Emotion-Cultural specific prompt. Our approach ranked first in the final test with a score of 0.627.
Paper Structure (12 sections, 2 equations, 4 figures, 3 tables)

This paper contains 12 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: ArtELingo, a mulitlingual dataset and benchmark of WikiArt with captions &emotions.
  • Figure 2: Outline of our proposed method. Pseudo Labels Retrieval Module is shown in the figure \ref{['fig: PLRM ']}. The construction details of Emotion-Cultural specific prompted text will be introduced below.
  • Figure 3: Pseudo Labels Retrieval Module. First, use the text encoder and image encoder radford2021learning to get the embeddings of text and images, and then Concatenate them together. Then use cosine similarity to calculate the embeddings of the current sample's embedding and other samples, determine whether it exceeds the threshold, and take the labels of the top-k samples as pseudo labels.
  • Figure 4: Test Time Augmentation: The original image is input into the model with text through four transformations