Table of Contents
Fetching ...

Language Shapes Mental Health Evaluations in Large Language Models

Jiayi Xu, Xiyang Hu

Abstract

This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.

Language Shapes Mental Health Evaluations in Large Language Models

Abstract

This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.
Paper Structure (25 sections, 3 equations, 4 figures, 2 tables)

This paper contains 25 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Large language models express systematically higher perceived public stigma and personal stigma when prompted in Chinese than in English. (a) Perceived public stigma, measured by the Devaluation--Discrimination Scale (DDS) link1987understanding; (b) personal stigma toward individuals with mental illness, measured by the Mental Illness Stigma Scale (MISS) day2007measuring; and (c) scenario-based personal stigma assessed using a vignette-based measure pescosolido2021trends. Across dimensions, both GPT-4o and Qwen3 expressed higher stigma scores when prompted in Chinese (red) relative to English (blue). Bars represent mean scores and error bars denote 95% confidence intervals. Statistical comparisons between language conditions were conducted using two-sided Welch's t-tests (*P < 0.05; **P < 0.01; ***P < 0.001).
  • Figure 2: Large language models express systematically higher depression-specific social stigma, self-stigma, and professional stigma when prompted in Chinese than in English. (a) Depression-specific perceived stigma and (b) depression-specific personal stigma, both measured by the Depression Stigma Scale (DSS) santomauro2021global; (c) self-stigma, measured by the Self-Stigma of Seeking Help Scale (SSOSH) vogel2006measuring; and (d) professional stigma, measured by the Opening Minds Scale for Health Care Providers (OMS-HC) modgill2014opening. Across dimensions, both GPT-4o and Qwen3-32b generated higher stigma scores when prompted in Chinese (red) relative to English (blue). Bars represent mean scores and error bars denote 95% confidence intervals. Statistical comparisons between language conditions were conducted using two-sided Welch's t-tests (*P < 0.05; **P < 0.01; ***P < 0.001).
  • Figure 3: Chinese prompts lead to systematic under-estimation in depression severity detection. Bars represent paired discordant cases derived from McNemar comparisons between English and Chinese prompts. Values to the left of zero indicate under-estimation relative to the gold severity label, whereas values to the right indicate over-estimation. Across both models, Chinese prompts show substantially more under-estimation cases than English prompts, indicating a systematic downward shift in predicted severity. In contrast, over-estimation differences are smaller and model-dependent. These results suggest that cross-linguistic differences manifest primarily as directional calibration shifts rather than random performance fluctuations.
  • Figure 4: Chinese prompts induce a systematic downward shift in predicted severity relative to English prompts. Severity levels were coded as ordinal values (Minimal = 0, Mild = 1, Moderate = 2, Severe = 3) when computing prediction error. Mean prediction error (predicted - true severity) is shown for English (blue) and Chinese (red) prompts across true severity levels. Positive values indicate overestimation relative to the gold label, whereas negative values indicate underestimation. Across both models, English prompts show greater overestimation for minimal and mild cases and smaller underestimation for moderate and severe cases. Error bars denote 95% confidence intervals.