Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions
Reem I. Masoud, Ziquan Liu, Martin Ferianc, Philip Treleaven, Miguel Rodrigues
TL;DR
The paper tackles cultural misalignment in LLMs by introducing Hofstede's Cultural Alignment Test (Hofstede's CAT) based on Hofstede's VSM13 with six cultural dimensions $PDI$, $IDV$, $MAS$, $UAI$, $LTO$, $IVR$. It applies multi-language prompting (English, Arabic, Chinese), seed-based response collection, and four prompting regimes to quantify cultural alignment and uses Kendall Tau to compare LLM-derived rankings against ground-truth Hofstede data across the US, China, and Arab regions. Key findings show GPT-4 exhibits stronger cultural sensitivity, especially in Chinese contexts, but struggles with American and Arab contexts, while language-specific fine-tuning and hyperparameters significantly shift cultural responses. The results highlight the necessity for culturally diverse training data and alignment strategies to achieve globally acceptable AI, and they offer a practical diagnostic framework for evaluating cultural alignment in LLMs. These insights support more responsible deployment of AI technologies across diverse cultural populations and guide future work at the intersection of AI, anthropology, and cross-cultural studies.
Abstract
The deployment of large language models (LLMs) raises concerns regarding their cultural misalignment and potential ramifications on individuals and societies with diverse cultural backgrounds. While the discourse has focused mainly on political and social biases, our research proposes a Cultural Alignment Test (Hoftede's CAT) to quantify cultural alignment using Hofstede's cultural dimension framework, which offers an explanatory cross-cultural comparison through the latent variable analysis. We apply our approach to quantitatively evaluate LLMs, namely Llama 2, GPT-3.5, and GPT-4, against the cultural dimensions of regions like the United States, China, and Arab countries, using different prompting styles and exploring the effects of language-specific fine-tuning on the models' behavioural tendencies and cultural values. Our results quantify the cultural alignment of LLMs and reveal the difference between LLMs in explanatory cultural dimensions. Our study demonstrates that while all LLMs struggle to grasp cultural values, GPT-4 shows a unique capability to adapt to cultural nuances, particularly in Chinese settings. However, it faces challenges with American and Arab cultures. The research also highlights that fine-tuning LLama 2 models with different languages changes their responses to cultural questions, emphasizing the need for culturally diverse development in AI for worldwide acceptance and ethical use. For more details or to contribute to this research, visit our GitHub page https://github.com/reemim/Hofstedes_CAT/
