Table of Contents
Fetching ...

Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

Erin MacMurray van Liemt, Aida Davani, Sinchana Kumbale, Neha Dixit, Sunipa Dev

Abstract

Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures ($ρ> 0.97$) across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.

Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

Abstract

Cultural representation in Large Language Model (LLM) outputs has primarily been evaluated through the proxies of cultural diversity and factual accuracy. However, a crucial gap remains in assessing cultural alignment: the degree to which generated content mirrors how native populations perceive and prioritize their own cultural facets. In this paper, we introduce a human-centered framework to evaluate the alignment of LLM generations with local expectations. First, we establish a human-derived ground-truth baseline of importance vectors, called Cultural Importance Vectors based on an induced set of culturally significant facets from open-ended survey responses collected across nine countries. Next, we introduce a method to compute model-derived Cultural Representation Vectors of an LLM based on a syntactically diversified prompt-set and apply it to three frontier LLMs (Gemini 2.5 Pro, GPT-4o, and Claude 3.5 Haiku). Our investigation of the alignment between the human-derived Cultural Importance and model-derived Cultural Representations reveals a Western-centric calibration for some of the models where alignment decreases as a country's cultural distance from the US increases. Furthermore, we identify highly correlated, systemic error signatures () across all models, which over-index on some cultural markers while neglecting the deep-seated social and value-based priorities of users. Our approach moves beyond simple diversity metrics toward evaluating the fidelity of AI-generated content in authentically capturing the nuanced hierarchies of global cultures.

Paper Structure

This paper contains 28 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The plots represent the relationship of each Country's Cultural Fixation Index (CFST), a measure of cultural distance from the US, and the alignment between Cultural Importance Vectors and each LLM's Cultural Representation Vector (Correlation on the top and Cosine Similarity on the bottom). While there is no significant trend for Gemini, both GPT and Claude show negative trends, where for countries with higher CFST (less culturally similar to the US), the model representations align less with the human preferences.
  • Figure 2: Mean Squared Error (MSE) between Cultural Representation Vectors and Cultural Importance Vectors. Each colored point represents the MSE for a specific model (Gemini, GPT, Claude) within a given cultural facet. Black circles with vertical bars indicate the mean MSE and the standard deviation. Lower MSE values signify a better alignment between the model's facet representation and the GSC.
  • Figure 3: Cultural Profile Comparison (GSC vs. LLMs) for three sample countries. The radar chart on the top represents the Human Baseline (GSC) derived from cultural surveys. Radar charts on the right represent responses from Gemini (purple), GPT-4o (pink), and Claude (blue). All axes are set to maximum value for each radar chart, allowing for direct visual comparison of the cultural "shapes".
  • Figure 4: Inter-Model Error Consistency across Countries and Facets. Bar plots illustrate the mean pairwise Pearson correlation between the error vectors of Gemini, GPT-4o, and Claude. Left: Consistency by Country shows similar models misrepresentations for all countries. Right: Consistency by Facet identifies cultural categories (e.g., Event) where models tend to make the different "mistakes" regardless of the country. The red dashed lines indicate the overall average consistency ($r = 0.97$ for countries and $0.82$ for facets).