Do Chinese models speak Chinese languages?
Andrea W Wen-Yi, Unso Eun Seo Jo, David Mimno
TL;DR
This study probes whether Chinese open-source LLMs meaningfully support languages spoken in China beyond Mandarin. It compares six Chinese and four Western models across 21 language variants using Information Parity (IP), Belebele MRC, and MC^2 language identification to test four hypotheses (Null, Mandarin, Pluralist, Regional). The results yield strong cross-language correlation between Chinese and Western models, with Mandarin notably advantaged in Chinese models, supporting the Mandarin Hypothesis but not clearly validating Pluralist or Regional claims. The findings suggest current open-source Chinese models are largely Mandarin-centric and compute budgets align with Western multilingual data distributions, highlighting a need for targeted minority-language data and evaluation, as well as policy-aware model development. Overall, the work informs end users about language strengths and limitations of open-source Chinese LLMs and motivates focused expansion of minority-language resources and benchmarks to reduce linguistic inequities in AI access.
Abstract
The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they speak the same languages as Western models? Comparing multilingual capabilities is important for two reasons. First, language ability provides insights into pre-training data curation, and thus into resource allocation and development priorities. Second, China has a long history of explicit language policy, varying between inclusivity of minority languages and a Mandarin-first policy. To test whether Chinese LLMs today reflect an agenda about China's languages, we test performance of Chinese and Western open-source LLMs on Asian regional and Chinese minority languages. Our experiments on Information Parity and reading comprehension show Chinese models' performance across these languages correlates strongly (r=0.93) with Western models', with the sole exception being better Mandarin. Sometimes, Chinese models cannot identify languages spoken by Chinese minorities such as Kazakh and Uyghur, even though they are good at French and German. These results provide a window into current development priorities, suggest options for future development, and indicate guidance for end users.
