Table of Contents
Fetching ...

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

Huihan Li, Liwei Jiang, Jena D. Hwang, Hyunwoo Kim, Sebastin Santy, Taylor Sorensen, Bill Yuchen Lin, Nouha Dziri, Xiang Ren, Yejin Choi

TL;DR

It is discovered that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures, and that LLMs have an uneven degree of diversity in the culture symbols.

Abstract

As the utilization of large language models (LLMs) has proliferated world-wide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found here: https://github.com/huihanlhh/Culture-Gen/

CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting

TL;DR

It is discovered that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures, and that LLMs have an uneven degree of diversity in the culture symbols.

Abstract

As the utilization of large language models (LLMs) has proliferated world-wide, it is crucial for them to have adequate knowledge and fair representation for diverse global cultures. In this work, we uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations, and extract symbols from these generations that are associated to each culture by the LLM. We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures. We also discover that LLMs have an uneven degree of diversity in the culture symbols, and that cultures from different geographic regions have different presence in LLMs' culture-agnostic generation. Our findings promote further research in studying the knowledge and fairness of global culture perception in LLMs. Code and Data can be found here: https://github.com/huihanlhh/Culture-Gen/
Paper Structure (42 sections, 10 figures, 11 tables)

This paper contains 42 sections, 10 figures, 11 tables.

Figures (10)

  • Figure 1: We construct Culture-Gen, a dataset of generations on 8 culture-related topics on 110 countries and regions, using gpt-4, llama2-13b, mistral-7b. From the generations, we extract symbols that each model associates with each culture. Using Culture-Gen, we the examine the generations with culture-distinguishing markers, and evaluate the diversity of cultural symbols and LM preferences to cultural symbols in culture-agnostic generations.
  • Figure 2: Different markedness for each geographic region by mistral-7b. Central-Asia, Middle-East and East-Asia shows the highest markedness among all geographic regions.
  • Figure 3: Generations for African and Asian cultures have most vocabulary markers.
  • Figure 4: Teal: Number of diverse culture symbols. Salmon: culture-topic co-occurrence in RedPajama (axis start from top). For llama2-13b, higher topic-related keyword co-occurrence correspond to less diverse cultural values ($\tau=-0.30$). For mistral-7b, higher topic-related keyword co-occurrence correspond to more diverse cultural values ($\tau=0.35$).
  • Figure 5: Overlap in "music instrument". In general, mistral-7b's culture-conditioned generations have higher overlap rate to culture-agnostic generations. For both gpt-4 and mistral-7b, West European, English Speaking and Nordic countries have the highest overlap rate.
  • ...and 5 more figures