Table of Contents
Fetching ...

An Evaluation of Cultural Value Alignment in LLM

Nicholas Sukiennik, Chen Gao, Fengli Xu, Yong Li

TL;DR

This paper conducts a large-scale evaluation of cultural value alignment in LLMs across 20 countries and 10 models using Hofstede's Values Survey Module (VSM). It introduces the Deviation Ratio to account for a global-average cultural skew and analyzes the effects of prompt language, model origin, and external factors such as GDP and web-content share. Key findings include a moderate global-average bias in outputs, the United States being the best-aligned country, and GLM-4 emerging as the top-aligning model; language prompts and data availability strongly influence alignment. The work highlights implications for culturally adaptive alignment, cautions about potential cross-cultural biases, and suggests directions for richer, more globally representative training data. These insights inform practical strategies for producing culturally considerate LLM outputs and identify avenues for future cross-cultural benchmarking.

Abstract

LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs' cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.

An Evaluation of Cultural Value Alignment in LLM

TL;DR

This paper conducts a large-scale evaluation of cultural value alignment in LLMs across 20 countries and 10 models using Hofstede's Values Survey Module (VSM). It introduces the Deviation Ratio to account for a global-average cultural skew and analyzes the effects of prompt language, model origin, and external factors such as GDP and web-content share. Key findings include a moderate global-average bias in outputs, the United States being the best-aligned country, and GLM-4 emerging as the top-aligning model; language prompts and data availability strongly influence alignment. The work highlights implications for culturally adaptive alignment, cautions about potential cross-cultural biases, and suggests directions for richer, more globally representative training data. These insights inform practical strategies for producing culturally considerate LLM outputs and identify avenues for future cross-cultural benchmarking.

Abstract

LLMs as intelligent agents are being increasingly applied in scenarios where human interactions are involved, leading to a critical concern about whether LLMs are faithful to the variations in culture across regions. Several works have investigated this question in various ways, finding that there are biases present in the cultural representations of LLM outputs. To gain a more comprehensive view, in this work, we conduct the first large-scale evaluation of LLM culture assessing 20 countries' cultures and languages across ten LLMs. With a renowned cultural values questionnaire and by carefully analyzing LLM output with human ground truth scores, we thoroughly study LLMs' cultural alignment across countries and among individual models. Our findings show that the output over all models represents a moderate cultural middle ground. Given the overall skew, we propose an alignment metric, revealing that the United States is the best-aligned country and GLM-4 has the best ability to align to cultural values. Deeper investigation sheds light on the influence of model origin, prompt language, and value dimensions on cultural output. Specifically, models, regardless of where they originate, align better with the US than they do with China. The conclusions provide insight to how LLMs can be better aligned to various cultures as well as provoke further discussion of the potential for LLMs to propagate cultural bias and the need for more culturally adaptable models.

Paper Structure

This paper contains 12 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison of ground truth and raw country results, with average ground truth and average of all LLM results.
  • Figure 2: Deviation from global average vs. difference from ground truth.
  • Figure 3: A comparison of model evaluation using two metrics: difference from ground truth (lower = better), and deviation ratio (higher = better). Each model ranking contains results with four prompting methods. The models that differ in rank between the two figures are highlighted in red.
  • Figure 4: Deviation ratio comparison between models of US-origin (a) and China-origin (b), and three forms of prompts: English, Chinese, and the average of all other languages. Scores show alignment to US and China ground truth culture scores.
  • Figure 5: Model origin between US and China-origin models, prompted in English, Chinese, and the average of all other languages. Scores are averaged over all country results to show average alignment w.r.t. model-original and prompt language.
  • ...and 3 more figures