Cultural Value Differences of LLMs: Prompt, Language, and Model Size
Qishuai Zhong, Yike Yun, Aixin Sun
TL;DR
This work investigates how large language models express cultural values across three axes: prompt content within a single language, prompt language, and model size. Using Hofstede's Value Survey Module (VSM) 2013 across six families of LLMs and 54 simulated identities, the study quantifies cultural-value representations through the six Hofstede dimensions and derives inter- and intra-set disparity metrics, including the novel Silhouette Score with Human Reference ($SS_h$) and the Model Cultural Disparity ($MCD$). The findings show that cultural values tend to be consistent within a single language but vary markedly across languages, with language-induced differences sometimes exceeding prompt-based perturbations; model size also significantly influences the expressed values, more so than model family differences. These results highlight language as the primary driver of cross-cultural value expressions in LLMs and suggest important implications for cross-lingual auditing, responsible deployment, and alignment research, while outlining future work to connect value patterns with generation quality and broader cultural surveys.
Abstract
Our study aims to identify behavior patterns in cultural values exhibited by large language models (LLMs). The studied variants include question ordering, prompting language, and model size. Our experiments reveal that each tested LLM can efficiently behave with different cultural values. More interestingly: (i) LLMs exhibit relatively consistent cultural values when presented with prompts in a single language. (ii) The prompting language e.g., Chinese or English, can influence the expression of cultural values. The same question can elicit divergent cultural values when the same LLM is queried in a different language. (iii) Differences in sizes of the same model (e.g., Llama2-7B vs 13B vs 70B) have a more significant impact on their demonstrated cultural values than model differences (e.g., Llama2 vs Mixtral). Our experiments reveal that query language and model size of LLM are the main factors resulting in cultural value differences.
