Table of Contents
Fetching ...

The fragility of "cultural tendencies" in LLMs

Kun Sun, Rong Wang

Abstract

In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large language models (LLMs), when prompted in different languages, display culturally specific tendencies. They report that the two models (i.e., GPT and ERNIE) respond in more interdependent and holistic ways when prompted in Chinese, and more independent and analytic ways when prompted in English. LSZ attribute these differences to deep-seated cultural patterns in the models, claiming that prompt language alone can induce substantial cultural shifts. While we acknowledge the empirical patterns they observed, we find their experiments, methods, and interpretations problematic. In this paper, we critically re-evaluate the methodology, theoretical framing, and conclusions of LSZ. We argue that the reported "cultural tendencies" are not stable traits but fragile artifacts of specific models and task design. To test this, we conducted targeted replications using a broader set of LLMs and a larger number of test items. Our results show that prompt language has minimal effect on outputs, challenging LSZ's claim that these models encode grounded cultural beliefs.

The fragility of "cultural tendencies" in LLMs

Abstract

In a recent study, Lu, Song, and Zhang (2025) (LSZ) propose that large language models (LLMs), when prompted in different languages, display culturally specific tendencies. They report that the two models (i.e., GPT and ERNIE) respond in more interdependent and holistic ways when prompted in Chinese, and more independent and analytic ways when prompted in English. LSZ attribute these differences to deep-seated cultural patterns in the models, claiming that prompt language alone can induce substantial cultural shifts. While we acknowledge the empirical patterns they observed, we find their experiments, methods, and interpretations problematic. In this paper, we critically re-evaluate the methodology, theoretical framing, and conclusions of LSZ. We argue that the reported "cultural tendencies" are not stable traits but fragile artifacts of specific models and task design. To test this, we conducted targeted replications using a broader set of LLMs and a larger number of test items. Our results show that prompt language has minimal effect on outputs, challenging LSZ's claim that these models encode grounded cultural beliefs.

Paper Structure

This paper contains 16 sections, 1 equation, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Results across experiments, visualized in three panels. (A) Effect sizes (Cohen's $d$) for each comparison. Positive values indicate higher scores for the first condition; negative values indicate higher scores for the second condition. Green bars denote statistically significant differences ($p < 0.05$), while grey bars denote non-significant results. (B) Heatmap of statistical significance expressed as $-\log_{10}(p)$. Darker brown corresponds to smaller $p$-values (greater significance). (C) Cross-language consistency, measured by the Phi coefficient ($\phi$), for the signifcant models in all Experiments. Values close to 1 indicate high agreement between choices under different prompt languages.