Table of Contents
Fetching ...

Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

Minsang Kim, Seungjun Baek

TL;DR

This work investigates how open-source LLMs judge cultural values across countries and how training choices shape these judgments. It introduces a probing framework based on the World Value Survey, converting items to multiple-choice prompts and quantifying alignment with human responses via Pearson correlation, including $r$ as the agreement metric and mean scores $\sum_{k=1}^K \hat{p}_k s_k$ for LLMs. Key findings reveal that LLMs closely mirror humans on Socio-Cultural Norms but lag on Social Systems and Progress, exhibit Western bias, but gain cross-cultural sensitivity through multilingual pretraining, with larger models showing stronger cultural awareness; synthetic data can further transfer cultural knowledge to smaller models, and alignment enhances human-likeness. These insights inform design choices for culturally aware AI, highlighting when and how to deploy LLMs responsibly in culturally diverse contexts.

Abstract

Large language models (LLMs) closely interact with humans, and thus need an intimate understanding of the cultural values of human society. In this paper, we explore how open-source LLMs make judgments on diverse categories of cultural values across countries, and its relation to training methodology such as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. In addition, LLMs tend to judge cultural values biased toward Western culture, which can be improved with training on the multilingual corpus. We also find that increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data. Our analysis reveals valuable insights into the design methodology of LLMs in connection with their understanding of cultural values.

Exploring Large Language Models on Cross-Cultural Values in Connection with Training Methodology

TL;DR

This work investigates how open-source LLMs judge cultural values across countries and how training choices shape these judgments. It introduces a probing framework based on the World Value Survey, converting items to multiple-choice prompts and quantifying alignment with human responses via Pearson correlation, including as the agreement metric and mean scores for LLMs. Key findings reveal that LLMs closely mirror humans on Socio-Cultural Norms but lag on Social Systems and Progress, exhibit Western bias, but gain cross-cultural sensitivity through multilingual pretraining, with larger models showing stronger cultural awareness; synthetic data can further transfer cultural knowledge to smaller models, and alignment enhances human-likeness. These insights inform design choices for culturally aware AI, highlighting when and how to deploy LLMs responsibly in culturally diverse contexts.

Abstract

Large language models (LLMs) closely interact with humans, and thus need an intimate understanding of the cultural values of human society. In this paper, we explore how open-source LLMs make judgments on diverse categories of cultural values across countries, and its relation to training methodology such as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs can judge socio-cultural norms similar to humans but less so on social systems and progress. In addition, LLMs tend to judge cultural values biased toward Western culture, which can be improved with training on the multilingual corpus. We also find that increasing model size helps a better understanding of social values, but smaller models can be enhanced by using synthetic data. Our analysis reveals valuable insights into the design methodology of LLMs in connection with their understanding of cultural values.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Comparison of human-ratings and LLM predictions across 55 countries in WVS Haerpfer2022. Left: Boxplots of human ratings per category across countries. Right: Corresponding cultural judgment scores estimated by Llama-2-70b Chat touvron2023llama.
  • Figure 2: Average Pearson's correlation of small sizes model up to 8B across all countries grouped by continents.
  • Figure 3: Average Pearson's correlation of larger sizes model including 13B, 34B, and 70B across all countries grouped by continents.
  • Figure 4: Comparison average Pearson's correlation between Chat models Vs. Non-Chat models.
  • Figure 5: Prompt template for WVS questions. For chat models, we apply chat templates in each model.