Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

Tadesse Destaw Belay; Israel Abebe Azime; Abinew Ali Ayele; Grigori Sidorov; Dietrich Klakow; Philipp Slusallek; Olga Kolesnikova; Seid Muhie Yimam

Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

Tadesse Destaw Belay, Israel Abebe Azime, Abinew Ali Ayele, Grigori Sidorov, Dietrich Klakow, Philipp Slusallek, Olga Kolesnikova, Seid Muhie Yimam

TL;DR

This paper introduces EthioEmo, a multilingual, multi-label emotion dataset for four Ethiopian languages (Amharic, Afan Oromo, Somali, Tigrinya) plus English evaluation data, and evaluates encoder-only, encoder-decoder, and decoder-only models on this task. It demonstrates that fine-tuned Afri-centric encoder-only models (notably AfroXLMR-76L) provide the strongest performance across languages, while zero-shot and few-shot large language models offer limited gains, especially for low-resource languages. The work includes diverse data sources, rigorous annotation with inter-annotator agreement analysis, translation experiments, and in-context learning studies, highlighting the persistent challenges in multi-label emotion classification and the impact of data, prompts, and model type. The EthioEmo benchmark, along with its lexicons and annotation guidelines, provides a valuable resource for advancing cross-lingual emotion understanding and evaluating future models in low-resource, multilingual settings.

Abstract

Large Language Models (LLMs) show promising learning and reasoning abilities. Compared to other NLP tasks, multilingual and multi-label emotion evaluation tasks are under-explored in LLMs. In this paper, we present EthioEmo, a multi-label emotion classification dataset for four Ethiopian languages, namely, Amharic (amh), Afan Oromo (orm), Somali (som), and Tigrinya (tir). We perform extensive experiments with an additional English multi-label emotion dataset from SemEval 2018 Task 1. Our evaluation includes encoder-only, encoder-decoder, and decoder-only language models. We compare zero and few-shot approaches of LLMs to fine-tuning smaller language models. The results show that accurate multi-label emotion classification is still insufficient even for high-resource languages such as English, and there is a large gap between the performance of high-resource and low-resource languages. The results also show varying performance levels depending on the language and model type. EthioEmo is available publicly to further improve the understanding of emotions in language models and how people convey emotions through various languages.

Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

TL;DR

Abstract

Evaluating the Capabilities of Large Language Models for Multi-label Emotion Understanding

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)