Table of Contents
Fetching ...

Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment

Dhruv Agarwal, Anya Shukla, Sunayana Sitaram, Aditya Vashistha

TL;DR

The paper tackles the question of whether regionally trained Indian LLMs truly reflect Indian culture or merely produce local language fluency. By grounding evaluation in Hofstede’s cultural onion across four tasks (Value Orientation, Opinion Alignment, Cultural Knowledge, Cultural Adaptation) and supplementing with a 115-person writing study, it compares six Indic models with six global baselines. Across datasets (WVS, GlobalOpinionQA, CulturalBench, NormAd) and the user study, Indic models fail to outperform global models in aligning with Indian values, often showing stronger Western bias even after prompting or regional fine-tuning, with formulas like CAD_q,m = Sim_q(m, India) − Sim_q(m, USA) and nCAD to normalize for cross-country differences. The findings argue for thick×wide, community-grounded, and untranslated regional corpora paired with population-scale evaluation to build truly sovereign LLMs, highlighting practical implications for HCI, NLP, and AI governance.

Abstract

Large language models (LLMs) are used worldwide, yet exhibit Western cultural tendencies. Many countries are now building ``regional'' or ``sovereign'' LLMs, but it remains unclear whether they reflect local values and practices or merely speak local languages. Using India as a case study, we evaluate six Indic and six global LLMs on two dimensions -- values and practices -- grounded in nationally representative surveys and community-sourced QA datasets. Across tasks, Indic models do not align better with Indian norms than global models; in fact, a U.S. respondent is a closer proxy for Indian values than any Indic model. We further run a user study with 115 Indian users and find that writing suggestions from both global and Indic LLMs introduce Westernized or exoticized writing. Prompting and regional fine-tuning fail to recover alignment and can even degrade existing knowledge. We attribute this to scarce culturally grounded data, especially for pretraining. We position cultural evaluation as a first-class requirement alongside multilingual benchmarks and offer a reusable, community-grounded methodology. We call for native, community-authored corpora and thickxwide evaluations to build truly sovereign LLMs.

Fluent but Foreign: Even Regional LLMs Lack Cultural Alignment

TL;DR

The paper tackles the question of whether regionally trained Indian LLMs truly reflect Indian culture or merely produce local language fluency. By grounding evaluation in Hofstede’s cultural onion across four tasks (Value Orientation, Opinion Alignment, Cultural Knowledge, Cultural Adaptation) and supplementing with a 115-person writing study, it compares six Indic models with six global baselines. Across datasets (WVS, GlobalOpinionQA, CulturalBench, NormAd) and the user study, Indic models fail to outperform global models in aligning with Indian values, often showing stronger Western bias even after prompting or regional fine-tuning, with formulas like CAD_q,m = Sim_q(m, India) − Sim_q(m, USA) and nCAD to normalize for cross-country differences. The findings argue for thick×wide, community-grounded, and untranslated regional corpora paired with population-scale evaluation to build truly sovereign LLMs, highlighting practical implications for HCI, NLP, and AI governance.

Abstract

Large language models (LLMs) are used worldwide, yet exhibit Western cultural tendencies. Many countries are now building ``regional'' or ``sovereign'' LLMs, but it remains unclear whether they reflect local values and practices or merely speak local languages. Using India as a case study, we evaluate six Indic and six global LLMs on two dimensions -- values and practices -- grounded in nationally representative surveys and community-sourced QA datasets. Across tasks, Indic models do not align better with Indian norms than global models; in fact, a U.S. respondent is a closer proxy for Indian values than any Indic model. We further run a user study with 115 Indian users and find that writing suggestions from both global and Indic LLMs introduce Westernized or exoticized writing. Prompting and regional fine-tuning fail to recover alignment and can even degrade existing knowledge. We attribute this to scarce culturally grounded data, especially for pretraining. We position cultural evaluation as a first-class requirement alongside multilingual benchmarks and offer a reusable, community-grounded methodology. We call for native, community-authored corpora and thickxwide evaluations to build truly sovereign LLMs.

Paper Structure

This paper contains 65 sections, 2 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Inglehart-Welzel (IW) cultural map. Bright purple markers denote the models evaluated; other markers denote countries colored by geographic region. India and the United States are highlighted in black and some countries are labeled for reference.
  • Figure 2: Cultural Alignment Differential (raw and normalized) for all models on GlobalOpinionQA under Default prompting. Negative values represent a US tilt, and positive values mean an Indian tilt. The dotted vertical line at $x=0$ represents no tilt. Significance stars: *** $p<0.001$, ** $p<0.01$, * $p<0.05$ (two-sided). Normality is checked with the Shapiro–Wilk test ($\alpha=0.05$); if normal, we run a one-sample t-test; otherwise a Wilcoxon signed-rank test. Green text shows effect sizes: (Cohen's $d$ for t-tests, $r$ for Wilcoxon.
  • Figure 3: Accuracy on India– and US–specific questions from CulturalBench. Error bars represent 95% confidence intervals. The dotted vertical line at $x=0.25$ marks random-chance accuracy (four answer choices). Stars denote pairwise differences that remain significant after Bonferroni correction (***: $p<0.001$, **: $p<0.01$ using a z-test for proportions).
  • Figure 4: Accuracy on NormAd questions about Indian and US social norms, grouped by the amount of contextual help: Country (hard), Value+Country, and Rule-of-Thumb (easiest). The dotted vertical line at $x=0.33$ marks random-chance accuracy (three answer choices). Stars mark differences that remain significant after Bonferroni correction (***: $p<0.001$, **: $p<0.01$, *: $p<0.05$).
  • Figure 5: (a) Inglehart-Welzel Cultural Map 2023 Inglehart2005. (b) Euclidean distance between each model and India under three prompting strategies. Lower is better. The dotted vertical line marks the distance between India and US on the cultural map; most models are farther from India than an average American.
  • ...and 3 more figures