Table of Contents
Fetching ...

The Impact of Language Mixing on Bilingual LLM Reasoning

Yihao Li, Jiayi Xin, Miranda Muqing Miao, Qi Long, Lyle Ungar

TL;DR

It is shown that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500, and a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy.

Abstract

Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing-alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We show that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by 2.92 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

The Impact of Language Mixing on Bilingual LLM Reasoning

TL;DR

It is shown that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500, and a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy.

Abstract

Proficient multilingual speakers often intentionally switch languages in the middle of a conversation. Similarly, recent reasoning-focused bilingual large language models (LLMs) with strong capabilities in both languages exhibit language mixing-alternating languages within their chain of thought. Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning. In this work, we study language switching in Chinese-English bilingual reasoning models. We identify reinforcement learning with verifiable rewards (RLVR) as the critical training stage that leads to language mixing. We show that language mixing can enhance reasoning: enforcing monolingual decoding reduces accuracy by 5.6 percentage points on MATH500. Additionally, a lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning, and when used to guide decoding, increases accuracy by 2.92 percentage points. Our findings suggest that language mixing is not merely a byproduct of multilingual training, but is a strategic reasoning behavior.

Paper Structure

This paper contains 42 sections, 8 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: An illustration of bilingual code‑switching improving reasoning performance. Two monolingual speakers, one in Chinese and the other in English, fail to solve a math problem, while an LLM robot that code‑switches between both succeeds. Black text denotes language‑agnostic content.
  • Figure 2: Overview of our analysis of language mixing in LLM reasoning. (a) We identify common language mixing patterns and triggers that lead to increased language mixing (Section \ref{['sec:patterns']}). (b) We compare unconstrained bilingual outputs with constrained monolingual outputs to evaluate the impact of language mixing on reasoning performance (Section \ref{['sec:sec3']}). (c) We train a probe to classify code-switches as {Beneficial, Neutral, or Harmful}, and use it to guide decoding (Section \ref{['sec:probe']}).
  • Figure 3: Four patterns of code-switching observed in LLM outputs. Top left: Phrase-level switching, often short and used for precision or efficiency. Top Right: Switching to English for technical terms. Bottom left: Switching to match reasoning or answer formats. Bottom right: Full switch to another language when the model is unable to find a solution.
  • Figure 4: Quantitative analysis of language-mixing behavior in Math500 responses. (a) Correlation between problem difficulty level and response token count for Chinese prompts. (b) Normalized switch count and non-prompt language fraction as functions of token count, showing both code-switching frequency and non-prompt language use increase as chain-of-thought reasoning lengthens.
  • Figure 5: Token-level constrained decoding: We mask out tokens from the undesired language, forcing generation in the target language.
  • ...and 4 more figures