Table of Contents
Fetching ...

CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding

Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, Seid Muhie Yimam

TL;DR

CuLEmo introduces a culturally rich, multilingual benchmark to evaluate emotion understanding beyond keyword cues, using native annotations across six languages. The study systematically tests a range of LLMs with English and in-language prompts, including explicit country context, to reveal how culture and language shape emotion and sentiment predictions. Key findings show substantial cross-cultural variation, prompt-language effects, and underrepresentation gaps that country-context prompts can mitigate, while larger improvements require richer cultural data and targeted tuning. The work provides a practical, open-access resource and guidance for building more culturally aware NLP systems with careful ethical considerations and ongoing community-driven enhancement.

Abstract

NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.

CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding

TL;DR

CuLEmo introduces a culturally rich, multilingual benchmark to evaluate emotion understanding beyond keyword cues, using native annotations across six languages. The study systematically tests a range of LLMs with English and in-language prompts, including explicit country context, to reveal how culture and language shape emotion and sentiment predictions. Key findings show substantial cross-cultural variation, prompt-language effects, and underrepresentation gaps that country-context prompts can mitigate, while larger improvements require richer cultural data and targeted tuning. The work provides a practical, open-access resource and guidance for building more culturally aware NLP systems with careful ethical considerations and ongoing community-driven enhancement.

Abstract

NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.

Paper Structure

This paper contains 46 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: CuLEmo dataset creation pipeline and evaluations of LLMs in emotion and sentiment tasks.
  • Figure 2: Emotion label distribution across countries/languages: the number of instances in each emotion label across languages from a total of 400 events.
  • Figure 3: Pairwise emotion label agreements across countries/languages (ordered by their average agreement). Abbreviations: US = USA, MX = Mexico, DE = Germany, AE = UAE, ET = Ethiopia, and IN = India.
  • Figure 4: Emotion prediction accuracy in radar chart across countries in English and in-language prompts. For lower-resource languages, English tends to work substantially better.
  • Figure 5: Sentiment (positive, negative, and neutral) distribution across countries in the CuLEmo dataset.
  • ...and 2 more figures