Table of Contents
Fetching ...

Cultural Value Differences of LLMs: Prompt, Language, and Model Size

Qishuai Zhong, Yike Yun, Aixin Sun

TL;DR

This work investigates how large language models express cultural values across three axes: prompt content within a single language, prompt language, and model size. Using Hofstede's Value Survey Module (VSM) 2013 across six families of LLMs and 54 simulated identities, the study quantifies cultural-value representations through the six Hofstede dimensions and derives inter- and intra-set disparity metrics, including the novel Silhouette Score with Human Reference ($SS_h$) and the Model Cultural Disparity ($MCD$). The findings show that cultural values tend to be consistent within a single language but vary markedly across languages, with language-induced differences sometimes exceeding prompt-based perturbations; model size also significantly influences the expressed values, more so than model family differences. These results highlight language as the primary driver of cross-cultural value expressions in LLMs and suggest important implications for cross-lingual auditing, responsible deployment, and alignment research, while outlining future work to connect value patterns with generation quality and broader cultural surveys.

Abstract

Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs). The studied variants include question ordering, prompting language, and model size. Our experiments reveal that each tested LLM can efficiently behave with different cultural values. More interestingly: (i) LLMs exhibit relatively consistent cultural values when presented with prompts in a single language. (ii) The prompting language e.g., Chinese or English, can influence the expression of cultural values. The same question can elicit divergent cultural values when the same LLM is queried in a different language. (iii) Differences in sizes of the same model (e.g., Llama2-7B vs 13B vs 70B) have a more significant impact on their demonstrated cultural values than model differences (e.g., Llama2 vs Mixtral). Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences.

Cultural Value Differences of LLMs: Prompt, Language, and Model Size

TL;DR

This work investigates how large language models express cultural values across three axes: prompt content within a single language, prompt language, and model size. Using Hofstede's Value Survey Module (VSM) 2013 across six families of LLMs and 54 simulated identities, the study quantifies cultural-value representations through the six Hofstede dimensions and derives inter- and intra-set disparity metrics, including the novel Silhouette Score with Human Reference () and the Model Cultural Disparity (). The findings show that cultural values tend to be consistent within a single language but vary markedly across languages, with language-induced differences sometimes exceeding prompt-based perturbations; model size also significantly influences the expressed values, more so than model family differences. These results highlight language as the primary driver of cross-cultural value expressions in LLMs and suggest important implications for cross-lingual auditing, responsible deployment, and alignment research, while outlining future work to connect value patterns with generation quality and broader cultural surveys.

Abstract

Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs). The studied variants include question ordering, prompting language, and model size. Our experiments reveal that each tested LLM can efficiently behave with different cultural values. More interestingly: (i) LLMs exhibit relatively consistent cultural values when presented with prompts in a single language. (ii) The prompting language e.g., Chinese or English, can influence the expression of cultural values. The same question can elicit divergent cultural values when the same LLM is queried in a different language. (iii) Differences in sizes of the same model (e.g., Llama2-7B vs 13B vs 70B) have a more significant impact on their demonstrated cultural values than model differences (e.g., Llama2 vs Mixtral). Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences.
Paper Structure (32 sections, 3 equations, 7 figures, 8 tables)

This paper contains 32 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The 6-d VSM scores for different experiment sets for each model are visualized using the t-SNE technique JMLR:v9:vandermaaten08a to facilitate direct comparisons. Results from English queries (denoted as "Eng") are displayed with black circles; results from English with Shuffled Options (denoted as "Eng w. Shuffle") are shown with pink stars; and results from Chinese (denoted as "Chn") are represented by green squares.
  • Figure 2: The three red heatmaps display the $SS_h$ values among models, with darker colors highlighting greater disparities. The green heatmap displays the differences in MMLU scores among models, corresponding to the disparities observed in the adjacent red heatmap.
  • Figure 3: Pipeline of investigations, exploring cultural values alignment in LLMs in three steps. (i) Evaluating cultural values exhibited by an LLM queried by a single language but with variants of prompts. (ii) Assessing cultural values in the context of different languages. (iii) Examining cultural values exhibited by different LLMs, within and across model families and in different model sizes.
  • Figure 4: Prompt samples for the two languages used in the experiment. In both samples, the syntax highlighted in red is copied from the original question in the questionnaire. During the VSM 2013 testing, there are approximately nine types of questions. All customized components are embedded with the respective values when querying the model.
  • Figure 5: VSM Questionnaire Page 1
  • ...and 2 more figures