Table of Contents
Fetching ...

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

Tarek Naous, Anagha Savit, Carlos Rafael Catalan, Geyang Guo, Jaehyeok Lee, Kyungdon Lee, Lheane Marie Dizon, Mengyu Ye, Neel Kothari, Sahajpreet Singh, Sarah Masud, Tanish Patwa, Trung Thanh Tran, Zohaib Khan, Alan Ritter, JinYeong Bak, Keisuke Sakaguchi, Tanmoy Chakraborty, Yuki Arase, Wei Xu

TL;DR

This paper tackles cultural biases in multilingual LLMs by introducing Camellia, a benchmark spanning nine Asian languages and six cultures. Camellia comprises 19,530 culturally annotated entities and 2,173 natural masked contexts, enabling evaluation across cultural adaptation, sentiment association, and extractive QA for four multilingual LLM families. Results reveal persistent Western bias in cultural adaptation, diverse sentiment associations by model family, and notable context-understanding gaps in non-English languages, with English translations mitigating some gaps. The work provides a comprehensive resource to measure and mitigate cultural bias in multilingual LLMs and emphasizes how data provenance and language resource availability shape cross-language cultural competence.

Abstract

As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.

Camellia: Benchmarking Cultural Biases in LLMs for Asian Languages

TL;DR

This paper tackles cultural biases in multilingual LLMs by introducing Camellia, a benchmark spanning nine Asian languages and six cultures. Camellia comprises 19,530 culturally annotated entities and 2,173 natural masked contexts, enabling evaluation across cultural adaptation, sentiment association, and extractive QA for four multilingual LLM families. Results reveal persistent Western bias in cultural adaptation, diverse sentiment associations by model family, and notable context-understanding gaps in non-English languages, with English translations mitigating some gaps. The work provides a comprehensive resource to measure and mitigate cultural bias in multilingual LLMs and emphasizes how data provenance and language resource availability shape cross-language cultural competence.

Abstract

As Large Language Models (LLMs) gain stronger multilingual capabilities, their ability to handle culturally diverse entities becomes crucial. Prior work has shown that LLMs often favor Western-associated entities in Arabic, raising concerns about cultural fairness. Due to the lack of multilingual benchmarks, it remains unclear if such biases also manifest in different non-Western languages. In this paper, we introduce Camellia, a benchmark for measuring entity-centric cultural biases in nine Asian languages spanning six distinct Asian cultures. Camellia includes 19,530 entities manually annotated for association with the specific Asian or Western culture, as well as 2,173 naturally occurring masked contexts for entities derived from social media posts. Using Camellia, we evaluate cultural biases in four recent multilingual LLM families across various tasks such as cultural context adaptation, sentiment association, and entity extractive QA. Our analyses show a struggle by LLMs at cultural adaptation in all Asian languages, with performance differing across models developed in regions with varying access to culturally-relevant data. We further observe that different LLM families hold their distinct biases, differing in how they associate cultures with particular sentiments. Lastly, we find that LLMs struggle with context understanding in Asian languages, creating performance gaps between cultures in entity extraction.

Paper Structure

This paper contains 53 sections, 1 equation, 10 figures, 11 tables.

Figures (10)

  • Figure 1: We construct Camellia, a benchmark to measure cultural biases for six Asian cultures, covering nine languages. Camellia provides 2,173 naturally-occurring masked contexts categorized into: culturally-grounded, culturally-neutral, and extractive QA. Camellia also provides 19,530 culturally relevant entities that contrast the respective Asian cultures vs. Western culture across six different entity types that exhibit cultural variation. The masked contexts and entities in Camellia enable the measurement of cultural biases in LLMs via versatile task setups.
  • Figure 2: Example per entity type and statistics of respective Asian entities per culture and Western entities in Camellia. Western entities are parallel for all 9 languages while Indian entities are parallel in all Indian languages (§\ref{['subsec:entities-collection']}). Camellia also provides an English translation for each entity.
  • Figure 3: Average Cultural Bias Score (CBS) ($\downarrow$) across entity types achieved by LLMs on culturally-grounded contexts (Camellia-Grounded) for each Asian language. LLMs can struggle to generate the appropriate Asian entities in each culture, assigning better likelihood to Western entities 30-40% of the time. See results per entity type in Appendix \ref{['appendix:results-cultural-adaptation']}.
  • Figure 4: Average CBS across entity types on culturally-grounded contexts vs culturally-neutral contexts. LLMs show more preference towards Western entities in culturally-neutral contexts (higher CBS). CBS scores are lower in culturally-grounded contexts, yet remain close to the neutral case.
  • Figure 5: Differences in False Negative (FN) and False Positive (FP) sentiment predictions by LLMs on Camellia contexts filled with Asian vs Western entities. Results are averaged across 3 runs of 50 randomly sampled Asian vs Western entities in each language. Llama and Gemma tend to associate Western entities with negativity, while Qwen and Aya tend to associate Asian entities with positivity.
  • ...and 5 more figures