CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Huihan Li; Liwei Jiang; Jena D. Hwang; Hyunwoo Kim; Sebastin Santy; Taylor Sorensen; Bill Yuchen Lin; Nouha Dziri; Xiang Ren; Yejin Choi

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, Yejin Choi

TL;DR

It is discovered that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures, and that LLMs have an uneven degree of diversity in the culture symbols.

Abstract

As the utilization of large language models (LLMs) has proliferated world-wide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found here: https://github.com/huihanlhh/Culture-Gen/

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

TL;DR

Abstract

Paper Structure (42 sections, 10 figures, 11 tables)

This paper contains 42 sections, 10 figures, 11 tables.

Introduction
Related Work
Collection of Culture-Gen
Countries and regions as a culture.
Prompting on culture-related topics.
Generative language models.
Finding Culture Symbols in Culture-Gen
Culture Symbols: concepts of a culture.
Step 1: Extracting candidate symbols from Culture-Gen generations.
Assigning Candidate Symbols to a Culture.
Statistics about Culture Symbols.
Ablation on the representativeness of Culture Symbols across demographics - age and gender.
LLM Global Culture Perception Analysis
Marked Cultures: a process of "othering" marginalized cultures from default cultures.
Markedness.
...and 27 more sections

Figures (10)

Figure 1: We construct Culture-Gen, a dataset of generations on 8 culture-related topics on 110 countries and regions, using gpt-4, llama2-13b, mistral-7b. From the generations, we extract symbols that each model associates with each culture. Using Culture-Gen, we the examine the generations with culture-distinguishing markers, and evaluate the diversity of cultural symbols and LM preferences to cultural symbols in culture-agnostic generations.
Figure 2: Different markedness for each geographic region by mistral-7b. Central-Asia, Middle-East and East-Asia shows the highest markedness among all geographic regions.
Figure 3: Generations for African and Asian cultures have most vocabulary markers.
Figure 4: Teal: Number of diverse culture symbols. Salmon: culture-topic co-occurrence in RedPajama (axis start from top). For llama2-13b, higher topic-related keyword co-occurrence correspond to less diverse cultural values ($\tau=-0.30$). For mistral-7b, higher topic-related keyword co-occurrence correspond to more diverse cultural values ($\tau=0.35$).
Figure 5: Overlap in "music instrument". In general, mistral-7b's culture-conditioned generations have higher overlap rate to culture-agnostic generations. For both gpt-4 and mistral-7b, West European, English Speaking and Nordic countries have the highest overlap rate.
...and 5 more figures

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

TL;DR

Abstract

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (10)