CULEMO: Cultural Lenses on Emotion -- Benchmarking LLMs for Cross-Cultural Emotion Understanding
Tadesse Destaw Belay, Ahmed Haj Ahmed, Alvin Grissom, Iqra Ameer, Grigori Sidorov, Olga Kolesnikova, Seid Muhie Yimam
TL;DR
CuLEmo introduces a culturally rich, multilingual benchmark to evaluate emotion understanding beyond keyword cues, using native annotations across six languages. The study systematically tests a range of LLMs with English and in-language prompts, including explicit country context, to reveal how culture and language shape emotion and sentiment predictions. Key findings show substantial cross-cultural variation, prompt-language effects, and underrepresentation gaps that country-context prompts can mitigate, while larger improvements require richer cultural data and targeted tuning. The work provides a practical, open-access resource and guidance for building more culturally aware NLP systems with careful ethical considerations and ongoing community-driven enhancement.
Abstract
NLP research has increasingly focused on subjective tasks such as emotion analysis. However, existing emotion benchmarks suffer from two major shortcomings: (1) they largely rely on keyword-based emotion recognition, overlooking crucial cultural dimensions required for deeper emotion understanding, and (2) many are created by translating English-annotated data into other languages, leading to potentially unreliable evaluation. To address these issues, we introduce Cultural Lenses on Emotion (CuLEmo), the first benchmark designed to evaluate culture-aware emotion prediction across six languages: Amharic, Arabic, English, German, Hindi, and Spanish. CuLEmo comprises 400 crafted questions per language, each requiring nuanced cultural reasoning and understanding. We use this benchmark to evaluate several state-of-the-art LLMs on culture-aware emotion prediction and sentiment analysis tasks. Our findings reveal that (1) emotion conceptualizations vary significantly across languages and cultures, (2) LLMs performance likewise varies by language and cultural context, and (3) prompting in English with explicit country context often outperforms in-language prompts for culture-aware emotion and sentiment understanding. The dataset and evaluation code are publicly available.
