Table of Contents
Fetching ...

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

Tarek Naous, Wei Xu

TL;DR

This work tackles the problem of Western cultural bias in multilingual language models by examining both pre-training data representations and cross-linguistic linguistic phenomena. It introduces CAMeL-2, a bilingual Arabic–English benchmark with 58,086 culturally associated entities and 367 natural contexts, designed to probe extractive QA and NER across cultures. Through analyses of entity frequency, Arabic polysemy, script overlaps, and subword tokenization, the paper shows that biases intensify in Arabic due to polysemy and lexical overlaps with other Arabic-script languages, and that frequency-based tokenization with large Arabic vocabularies worsens recognition of Arab entities. The findings reveal that biases are mitigated in English more than in Arabic, highlighting the need for culturally informed data curation and tokenization strategies to build fairer multilingual LMs with respect to non-Western cultures.

Abstract

Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2

On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena

TL;DR

This work tackles the problem of Western cultural bias in multilingual language models by examining both pre-training data representations and cross-linguistic linguistic phenomena. It introduces CAMeL-2, a bilingual Arabic–English benchmark with 58,086 culturally associated entities and 367 natural contexts, designed to probe extractive QA and NER across cultures. Through analyses of entity frequency, Arabic polysemy, script overlaps, and subword tokenization, the paper shows that biases intensify in Arabic due to polysemy and lexical overlaps with other Arabic-script languages, and that frequency-based tokenization with large Arabic vocabularies worsens recognition of Arab entities. The findings reveal that biases are mitigated in English more than in Arabic, highlighting the need for culturally informed data curation and tokenization strategies to build fairer multilingual LMs with respect to non-Western cultures.

Abstract

Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
Paper Structure (59 sections, 1 equation, 18 figures, 10 tables)

This paper contains 59 sections, 1 equation, 18 figures, 10 tables.

Figures (18)

  • Figure 1: Responses of a LM () tasked to extract the food dish from the same text in English and Arabic. The LM identifies the Arab dish "Makloube" in English, but fails in Arabic where the word "Makloube" holds two senses. The LM does not struggle with the Western dish "Lasagna" which holds only one sense in both languages.
  • Figure 2: Cultural Bias Score ($\downarrow$) (§ \ref{['subsec:text-infilling-task']}) per entity type on culturally-grounded contexts from CAMeL-2. LMs can adapt better to Arab culture when tested in English.
  • Figure 3: Average QA Accuracy ($\uparrow$) of LLMs when tested in Arabic and English on location, name, food, and beverage associated with Arab culture, stratified by their occurrence counts in the mC4 corpus (§ \ref{['subsec:corpus-frequencies']}; grouped into log10-spaced bins). Gray bars in background represent number of entities tested in each bin. Interestingly, LMs struggle with very high-frequency entities in Arabic.
  • Figure 4: QA accuracy ($\uparrow$) for different sizes of LMs on high-frequency polysemous and non-polysemous Arab location entities. We find a positive scaling trend for all models, with lower performance on polysemous entities.
  • Figure 5: Average QA Accuracy ($\uparrow$) of Llama3.3-70b on the top-100 most frequent location entities in mC4 for each Arab and Western country in CAMeL-2 (§ \ref{['subsec:entity-polysemy-analysis']}). Arab countries are grouped by the language family that influences location naming in their region. Performance on Arab locations decreases as the percentage of entities that are Arabic polysemous words increases. Performance in English on the same entities is shown as a gray background.
  • ...and 13 more figures