Table of Contents
Fetching ...

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Ariba Khan, Stephen Casper, Dylan Hadfield-Menell

Abstract

Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Abstract

Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.

Paper Structure

This paper contains 24 sections, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Core assumptions about LLM cultural alignment fail under systematic evaluation. Our experiments reveal that cultural alignment in LLMs is: (1) not stable (\ref{['sec:stability']}, \ref{['fig:combined_stability_binary']}, \ref{['fig:combined_stability_multi']}) - response variations from trivial format changes often exceed real-world cultural differences; (2) not extrapolable (\ref{['sec:extrapolation']}, \ref{['fig:dimension_disagreement_impact_bar_plot']}) - extrapolation from limited dimensions produces near-random clustering results, with strong sensitivity to which dimensions are included; and (3) not steerable (\ref{['sec:steerability']}, \ref{['fig:steerability']}) - even optimized prompting techniques produce erratic, un-humanlike response patterns that fail to align with cultural perspectives.
  • Figure 2: LLMs' expressed preferences vary greatly under non-semantic changes to question presentation. (Left) Normalized Category Shift Size shows proportion of maximum possible shift when changing Direction Format (blue, ascending vs. descending) or Response Format (orange, identifier-only vs. option-text). (Right) Effect Size (Weighted Mean Difference) measures response magnitude changes between format conditions. Red dashed line represents one standard deviation (0.114) of between-country human response variation. The overall change in assessed preference often exceeds one human standard deviation. Hypothesis testing: $*$/$**$/$***$ = $p<$0.05/0.01/0.001. (Left) Chi-square test against the null hypothesis that response categories are independent of format changes. (Right) One-sided permutation test with 10,000 iterations against the null hypothesis that shifts in model outputs between presentation conditions are due to random chance.
  • Figure 3: Binary variations in evaluation design impact LLMs' expressed cultural preferences. (Left) LLM cover letter evaluations vary under comparative versus absolute preference elicitation. Normalized preferences (-1 to +1) for comparative (blue) and absolute (orange) ratings reveal differences in distributions across cultural dimensions. (Right) Reasoning requirements alter rating distributions. Rating patterns with reasoning (blue) and without reasoning (orange) show varying distributions across the same dimensions. Hypothesis testing: $*$/$**$/$***$ = $p<$0.05/0.01/0.001 according to one-sided permutation tests with 10,000 iterations against the null hypothesis of no difference in mean ratings between conditions.
  • Figure 4: Multi-category variations in evaluation design affect LLMs' cultural preferences. (Left) The Likert scale size affects LLM's implicit cultural preferences. Response patterns across 4-point (blue), 5-point (orange), and 6-point (green) scales show differences in preference distributions across cultural dimensions. (Right) Trivial changes in the role that an LLM is prompted to play can influence expressed preferences. Normalized ratings from Hiring Manager (blue), Job Applicant (orange), and Career Coach (green) perspectives show systematic variations. Hypothesis testing: $*$/$**$/$***$ = $p<$0.05/0.01/0.001 according to one-sided permutation tests with 10,000 iterations against the null hypothesis that there is no difference in mean ratings between conditions.
  • Figure 5: Extrapolation across cultural dimensions is unreliable in humans and LLMs alike. The validity of extrapolation is highly sensitive to the geometry of individual cultural dimensions (Left) For humans and LLMs, extrapolability (as measured by clustering ARI) increases with the number of observed dimensions. However, it is near a random guess baseline for low numbers of observed dimensions. (Right) Different cultural dimensions have very different impacts on extrapolation between dimensions. For humans (blue), Indulgence/Restraint strengthens groupings while Masculinity/Femininity weakens them. For LLMs (orange), Uncertainty Avoidance Index strengthens while Long/Short Term orientation weakens the clustering. Hypothesis testing: $*$/$**$/$***$ = $p<$0.05/0.01/0.001. (Left) For each group (Countries/LLM), the null hypothesis was that the clustering similarity between subsets of dimensions versus all dimensions could arise from random cluster assignments. (Right) For each group (Countries/LLM), the null hypothesis was that there is no difference in mean ARI scores between dimension subsets that include vs. exclude each dimension.
  • ...and 2 more figures