Table of Contents
Fetching ...

Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions

Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, Miguel Rodrigues

TL;DR

The paper tackles cultural misalignment in LLMs by introducing Hofstede's Cultural Alignment Test (Hofstede's CAT) based on Hofstede's VSM13 with six cultural dimensions $PDI$, $IDV$, $MAS$, $UAI$, $LTO$, $IVR$. It applies multi-language prompting (English, Arabic, Chinese), seed-based response collection, and four prompting regimes to quantify cultural alignment and uses Kendall Tau to compare LLM-derived rankings against ground-truth Hofstede data across the US, China, and Arab regions. Key findings show GPT-4 exhibits stronger cultural sensitivity, especially in Chinese contexts, but struggles with American and Arab contexts, while language-specific fine-tuning and hyperparameters significantly shift cultural responses. The results highlight the necessity for culturally diverse training data and alignment strategies to achieve globally acceptable AI, and they offer a practical diagnostic framework for evaluating cultural alignment in LLMs. These insights support more responsible deployment of AI technologies across diverse cultural populations and guide future work at the intersection of AI, anthropology, and cross-cultural studies.

Abstract

The deployment of large language models (LLMs) raises concerns regarding their cultural misalignment and potential ramifications on individuals and societies with diverse cultural backgrounds. While the discourse has focused mainly on political and social biases, our research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. We apply our approach to quantitatively evaluate LLMs, namely Llama 2, GPT-3.5, and GPT-4, against the cultural dimensions of regions like the United States, China, and Arab countries, using different prompting styles and exploring the effects of language-specific fine-tuning on the models' behavioural tendencies and cultural values. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions. Our study demonstrates that while all LLMs struggle to grasp cultural values, GPT-4 shows a unique capability to adapt to cultural nuances, particularly in Chinese settings. However, it faces challenges with American and Arab cultures. The research also highlights that fine-tuning LLama 2 models with different languages changes their responses to cultural questions, emphasizing the need for culturally diverse development in AI for worldwide acceptance and ethical use. For more details or to contribute to this research, visit our GitHub page https://github.com/reemim/Hofstedes_CAT/

Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions

TL;DR

The paper tackles cultural misalignment in LLMs by introducing Hofstede's Cultural Alignment Test (Hofstede's CAT) based on Hofstede's VSM13 with six cultural dimensions , , , , , . It applies multi-language prompting (English, Arabic, Chinese), seed-based response collection, and four prompting regimes to quantify cultural alignment and uses Kendall Tau to compare LLM-derived rankings against ground-truth Hofstede data across the US, China, and Arab regions. Key findings show GPT-4 exhibits stronger cultural sensitivity, especially in Chinese contexts, but struggles with American and Arab contexts, while language-specific fine-tuning and hyperparameters significantly shift cultural responses. The results highlight the necessity for culturally diverse training data and alignment strategies to achieve globally acceptable AI, and they offer a practical diagnostic framework for evaluating cultural alignment in LLMs. These insights support more responsible deployment of AI technologies across diverse cultural populations and guide future work at the intersection of AI, anthropology, and cross-cultural studies.

Abstract

The deployment of large language models (LLMs) raises concerns regarding their cultural misalignment and potential ramifications on individuals and societies with diverse cultural backgrounds. While the discourse has focused mainly on political and social biases, our research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. We apply our approach to quantitatively evaluate LLMs, namely Llama 2, GPT-3.5, and GPT-4, against the cultural dimensions of regions like the United States, China, and Arab countries, using different prompting styles and exploring the effects of language-specific fine-tuning on the models' behavioural tendencies and cultural values. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions. Our study demonstrates that while all LLMs struggle to grasp cultural values, GPT-4 shows a unique capability to adapt to cultural nuances, particularly in Chinese settings. However, it faces challenges with American and Arab cultures. The research also highlights that fine-tuning LLama 2 models with different languages changes their responses to cultural questions, emphasizing the need for culturally diverse development in AI for worldwide acceptance and ethical use. For more details or to contribute to this research, visit our GitHub page https://github.com/reemim/Hofstedes_CAT/
Paper Structure (21 sections, 2 equations, 5 figures, 17 tables)

This paper contains 21 sections, 2 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: Our framework, Hofstede's Cultural Alignment Test (Hofstede's CAT) for LLMs, detailing the VSM13 questionnaire, the LLM prompts, the instructing LLMs, and the resulting cultural dimensions derived from the LLM's responses.
  • Figure 2: Display of real-world VSM13 scores and normalized scores from models GPT-3.5, GPT-4, and Llama 2 for the countries in focus.
  • Figure 3: The changes in cultural dimensions upon changing the temperature and top-$p$ settings in GPT-3.5.
  • Figure 4: An example from the actual VSM13 questions with its corresponding adjusted prompt and generated response by GPT-3.5.
  • Figure 5: a) Real-world VSM13 scores for the mentioned countries. Normalized scores were generated by b) GPT-3.5 and c) GPT-4 in English, Chinese, and Arabic.