Table of Contents
Fetching ...

Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs

Jonathan Rystrøm, Hannah Rose Kirk, Scott Hale

TL;DR

This work interrogates whether advancing multilingual capabilities in large language models yields better cultural alignment with population value distributions. It introduces a distribution-based evaluation framework anchored to World Values Survey data and uses a multi-language, multi-model comparison (Gemma, OLMo, GPT-turbo) to assess alignment and US-centric bias. Using linear mixed-effects regression and self-consistency controls, the authors show that multilingual improvements do not consistently translate to cultural alignment, and self-consistency emerges as a stronger predictor. The findings underscore the need for targeted alignment strategies beyond general capability scaling to ensure culturally respectful and globally useful LLM behavior.

Abstract

Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare two families of models: Google's Gemma models (2B--27B parameters) and successive iterations of OpenAI's turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across languages, the OpenAI models do not. Importantly, we find that self-consistency is a stronger predictor of multicultural alignment than multilingual capabilities. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.

Multilingual != Multicultural: Evaluating Gaps Between Multilingual Capabilities and Cultural Alignment in LLMs

TL;DR

This work interrogates whether advancing multilingual capabilities in large language models yields better cultural alignment with population value distributions. It introduces a distribution-based evaluation framework anchored to World Values Survey data and uses a multi-language, multi-model comparison (Gemma, OLMo, GPT-turbo) to assess alignment and US-centric bias. Using linear mixed-effects regression and self-consistency controls, the authors show that multilingual improvements do not consistently translate to cultural alignment, and self-consistency emerges as a stronger predictor. The findings underscore the need for targeted alignment strategies beyond general capability scaling to ensure culturally respectful and globally useful LLM behavior.

Abstract

Large Language Models (LLMs) are becoming increasingly capable across global languages. However, the ability to communicate across languages does not necessarily translate to appropriate cultural representations. A key concern is US-centric bias, where LLMs reflect US rather than local cultural values. We propose a novel methodology that compares LLM-generated response distributions against population-level opinion data from the World Value Survey across four languages (Danish, Dutch, English, and Portuguese). Using a rigorous linear mixed-effects regression framework, we compare two families of models: Google's Gemma models (2B--27B parameters) and successive iterations of OpenAI's turbo-series. Across the families of models, we find no consistent relationships between language capabilities and cultural alignment. While the Gemma models have a positive correlation between language capability and cultural alignment across languages, the OpenAI models do not. Importantly, we find that self-consistency is a stronger predictor of multicultural alignment than multilingual capabilities. Our results demonstrate that achieving meaningful cultural alignment requires dedicated effort beyond improving general language capabilities.

Paper Structure

This paper contains 20 sections, 4 equations, 5 figures.

Figures (5)

  • Figure 1: The relationship between multilingual capability and cultural alignment is inconsistent across LLM families, as shown by coefficients from our linear mixed-effects model ($\beta_{multilingual}=\beta_{flm}$; Eq. \ref{['eq:rq1-eq']}; §\ref{['sec:methods-rq1']}). OpenAI and OLMo models show negative or insignificant relationships outside of Danish and Dutch, while Gemma models show positive relationships throughout ($p<.05$).
  • Figure 2: Pearson correlations in value polarity scores across studied countries from the World Values Survey. Value polarity scores are the fraction of the population in favour of a given topic. All correlations are positive, with most being between 0.7--0.95.
  • Figure 3: Self-consistency in responses for LLMs and WVS countries. LLMs have lower self-consistency than resampled WVS responses---shown by the dashed lines---particularly in non-English languages.
  • Figure 4: Language capability (x-axis) vs cultural alignment scores (y-axis) across languages. Stars indicate significance ($p<.05$) in our linear mixed-effects regression of multiple runs (See §\ref{['sec:methods-rq1']}). OpenAI models (blue) and OLMo models (red) show negative/insignificant relationships outside of English, while the Gemma models (green) show positive relationships throughout ($p<.05$).
  • Figure 5: US-centric bias coefficients across LLMs and languages ($\beta_{BiasUS}$); see Eq. \ref{['eq:rq2-eq']}). Error bars are standard errors from the regression. Positive values indicate the presence of US-centric bias.