Table of Contents
Fetching ...

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Zhi Rui Tam, Cheng-Kuang Wu, Yu Ying Chiu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

TL;DR

The paper investigates how large reasoning models (LRMs) solve problems in multilingual settings by examining the internal language they use for reasoning, revealing a hub-language phenomenon where English or Chinese dominates thinking regardless of input. It introduces a text prefilling method to steer the model's thinking language and a segmentation-classification framework to analyze reasoning patterns, applying them to reasoning tasks (MMMLU, MATH-500) as well as non-reasoning benchmarks (CulturalBench, LMSYS-Toxic). Key findings show that hub-language reasoning improves accuracy on reasoning tasks, while native-language reasoning can reduce performance for low-resource languages but benefits cultural and regional safety considerations in some contexts; safety outcomes exhibit language-specific biases. The segmentation-classification approach uncovers language-driven reasoning signatures (e.g., Chinese-prefill promotes subgoal setting; English-prefill promotes backward chaining), suggesting that language primes activate culturally embedded problem-solving schemas. Overall, the work highlights biases in multilingual LRMs and offers a practical, scalable method to guide reasoning language for more equitable deployment across languages and tasks.

Abstract

Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

TL;DR

The paper investigates how large reasoning models (LRMs) solve problems in multilingual settings by examining the internal language they use for reasoning, revealing a hub-language phenomenon where English or Chinese dominates thinking regardless of input. It introduces a text prefilling method to steer the model's thinking language and a segmentation-classification framework to analyze reasoning patterns, applying them to reasoning tasks (MMMLU, MATH-500) as well as non-reasoning benchmarks (CulturalBench, LMSYS-Toxic). Key findings show that hub-language reasoning improves accuracy on reasoning tasks, while native-language reasoning can reduce performance for low-resource languages but benefits cultural and regional safety considerations in some contexts; safety outcomes exhibit language-specific biases. The segmentation-classification approach uncovers language-driven reasoning signatures (e.g., Chinese-prefill promotes subgoal setting; English-prefill promotes backward chaining), suggesting that language primes activate culturally embedded problem-solving schemas. Overall, the work highlights biases in multilingual LRMs and offers a practical, scalable method to guide reasoning language for more equitable deployment across languages and tasks.

Abstract

Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of language choice varies by task type: input-language reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior. By exposing these linguistic biases in LRMs, our work highlights a critical step toward developing more equitable models that serve users across diverse linguistic backgrounds.

Paper Structure

This paper contains 41 sections, 12 figures, 18 tables.

Figures (12)

  • Figure 1: We control LRMs' thinking language by prefilling a language-specific prefill tokens (e.g., "Okay" for English in blue cell) after the <think> token. In reasoning tasks, thinking in "reasoning hub" language (e.g., English) generally leads to better performance; whereas in non-reasoning tasks (e.g., toxicity detection), thinking in non "reasoning hub" language (e.g., Japanese) enables LRMs to notice the safety problem and reject the user's toxic request.
  • Figure 2: Language distribution visualization. Top: Distribution in the reason section showing language detection patterns across different models. Middle: Distribution in the answer section reveals how language preferences shift between reasoning and final outputs. Bottom: Distribution in the reason section after applying phrase prefilling, all reasoning languages were able to align well with the input language.
  • Figure 3: Model performance comparison across global regions when using English versus native language prompts
  • Figure 4: Two-stage pipeline for step-level category annotation of reasoning chains.
  • Figure 5: Correlation Matrix Between Prefill Target Languages and Reasoning Types
  • ...and 7 more figures