Table of Contents
Fetching ...

Do Chinese models speak Chinese languages?

Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno

TL;DR

This study probes whether Chinese open-source LLMs meaningfully support languages spoken in China beyond Mandarin. It compares six Chinese and four Western models across 21 language variants using Information Parity (IP), Belebele MRC, and MC^2 language identification to test four hypotheses (Null, Mandarin, Pluralist, Regional). The results yield strong cross-language correlation between Chinese and Western models, with Mandarin notably advantaged in Chinese models, supporting the Mandarin Hypothesis but not clearly validating Pluralist or Regional claims. The findings suggest current open-source Chinese models are largely Mandarin-centric and compute budgets align with Western multilingual data distributions, highlighting a need for targeted minority-language data and evaluation, as well as policy-aware model development. Overall, the work informs end users about language strengths and limitations of open-source Chinese LLMs and motivates focused expansion of minority-language resources and benchmarks to reduce linguistic inequities in AI access.

Abstract

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.

Do Chinese models speak Chinese languages?

TL;DR

This study probes whether Chinese open-source LLMs meaningfully support languages spoken in China beyond Mandarin. It compares six Chinese and four Western models across 21 language variants using Information Parity (IP), Belebele MRC, and MC^2 language identification to test four hypotheses (Null, Mandarin, Pluralist, Regional). The results yield strong cross-language correlation between Chinese and Western models, with Mandarin notably advantaged in Chinese models, supporting the Mandarin Hypothesis but not clearly validating Pluralist or Regional claims. The findings suggest current open-source Chinese models are largely Mandarin-centric and compute budgets align with Western multilingual data distributions, highlighting a need for targeted minority-language data and evaluation, as well as policy-aware model development. Overall, the work informs end users about language strengths and limitations of open-source Chinese LLMs and motivates focused expansion of minority-language resources and benchmarks to reduce linguistic inequities in AI access.

Abstract

The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.

Paper Structure

This paper contains 22 sections, 2 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Information Parity of Chinese models vs. Western models.
  • Figure 2: Correlation of IP and MRC Accuracy between Chinese and Western instruction-tuned models across languages. Across languages, the two model groups have a Pearson correlation of $0.925$ in IP and $0.991$ in MRC accuracy.
  • Figure 3: MRC Accuracy of Chinese vs. Western models. Chinese models have higher accuracy in reading comprehension questions than Western models in Mandarin. With base models, Chinese models have higher accuracy than Western models in most languages except Burmese, Lao, Jingpho, and Tibetan. But both groups are similar on their instruction-tuned models across all languages. We notice that DeepSeek-R1-Qwen is more competitive with its custom chat template (See Appendix Figure \ref{['fig:chat-template-effect-MRC']}). Since we apply a consistent prompt across all models for comparability, we exclude DeepSeek-R1-Qwen from the grouped bar results. Instead, we highlight its performance with the more effective chat template using a green triangle in (a).
  • Figure 4: Average IP (vs. English) and MRC Accuracy of each instruction-tuned model for select languages. In both figures, Chinese LLMs are represented by circle markers, and Western LLMs by diamond-plus markers. The vertical line in the MRC figure is the $0.25$ random baseline. Chinese LLMs all have higher IP than Western LLMs in Simplified Mandarin. In MRC accuracy, Gemma2-Instruct is consistently the highest, and DeepSeek models underperform. The order of models stays similar across languages, except in Tibetan, where most models are near random.
  • Figure 5: Change in MRC accuracy with instruction-tuned vs. base models. Instruction-tuned models generally outperform their base versions, especially Llama3. Western instruction-tuned models show a larger accuracy gain over their base models across most languages compared to Chinese models.
  • ...and 7 more figures