Table of Contents
Fetching ...

University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Ikhlasul Akmal Hanif, Eryawan Presma Yulianrifat, Jaycent Gunawan Ongris, Eduardus Tjitrahardja, Muhammad Falensi Azmi, Rahmat Bryan Naufal, Alfan Farizki Wicaksono

TL;DR

This work tackles multilingual multi-label emotion detection across 28 languages in SemEval 2025 Task 11 Track A by comparing classifier-only training with end-to-end fine-tuning of encoder-based models. It demonstrates that embedding-based methods using prompt-based encoders (notably BGE and mE5) coupled with tree-based classifiers (CatBoost) outperform fully fine-tuned transformers, with ensembles further boosting performance (average F1-macro across languages = 56.58). The study highlights that high-quality multilingual embeddings, paired with efficient classifiers and careful prompt engineering, offer a practical and scalable approach for emotion detection across diverse languages. It also finds that multilingual training does not consistently beat language-specific training and that prompt design significantly influences results. Overall, the results advocate for leveraging strong embeddings and lightweight classifiers over full transformer fine-tuning in multilingual multi-label emotion classification, enabling effective cross-language emotion understanding with reduced computational cost.

Abstract

This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

TL;DR

This work tackles multilingual multi-label emotion detection across 28 languages in SemEval 2025 Task 11 Track A by comparing classifier-only training with end-to-end fine-tuning of encoder-based models. It demonstrates that embedding-based methods using prompt-based encoders (notably BGE and mE5) coupled with tree-based classifiers (CatBoost) outperform fully fine-tuned transformers, with ensembles further boosting performance (average F1-macro across languages = 56.58). The study highlights that high-quality multilingual embeddings, paired with efficient classifiers and careful prompt engineering, offer a practical and scalable approach for emotion detection across diverse languages. It also finds that multilingual training does not consistently beat language-specific training and that prompt design significantly influences results. Overall, the results advocate for leveraging strong embeddings and lightweight classifiers over full transformer fine-tuning in multilingual multi-label emotion classification, enabling effective cross-language emotion understanding with reduced computational cost.

Abstract

This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.

Paper Structure

This paper contains 22 sections, 4 equations, 1 figure, 13 tables.

Figures (1)

  • Figure 1: Our system overview