Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?
Zhi Rui Tam, Cheng-Kuang Wu, Yu Ying Chiu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee
TL;DR
The paper investigates how large reasoning models (LRMs) solve problems in multilingual settings by examining the internal language they use for reasoning, revealing a hub-language phenomenon where English or Chinese dominates thinking regardless of input. It introduces a text prefilling method to steer the model's thinking language and a segmentation-classification framework to analyze reasoning patterns, applying them to reasoning tasks (MMMLU, MATH-500) as well as non-reasoning benchmarks (CulturalBench, LMSYS-Toxic). Key findings show that hub-language reasoning improves accuracy on reasoning tasks, while native-language reasoning can reduce performance for low-resource languages but benefits cultural and regional safety considerations in some contexts; safety outcomes exhibit language-specific biases. The segmentation-classification approach uncovers language-driven reasoning signatures (e.g., Chinese-prefill promotes subgoal setting; English-prefill promotes backward chaining), suggesting that language primes activate culturally embedded problem-solving schemas. Overall, the work highlights biases in multilingual LRMs and offers a practical, scalable method to guide reasoning language for more equitable deployment across languages and tasks.
Abstract
Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.
