Table of Contents
Fetching ...

Multilingual Test-Time Scaling via Initial Thought Transfer

Prasoon Bajpai, Tanmoy Chakraborty

TL;DR

This work addresses the gap in multilingual reasoning by systematically evaluating test-time scaling across high- and low-resource Latin-script languages using two DeepSeek-R1 models on the multilingual AIME2025 benchmark. It reveals substantial language-dependent variability in scaling, with prevalent English leakage during long reasoning and divergent initial thought patterns for low-resource languages. The authors introduce MITT (Multilingual Initial Thought Transfer), a lightweight, unsupervised prefix-tuning approach that transfers high-resource reasoning prefixes to improve cross-language reasoning without cross-lingual supervision. MITT demonstrates improved reasoning performance and more stable scaling for DeepSeek-R1-Distill-Qwen-7B, especially in underrepresented languages, highlighting a practical path to strengthen multilingual reasoning fidelity at inference time. The findings underscore the importance of reasoning-grounded multilingual generalization and provide diagnostic tools and an effective intervention to reduce cross-language disparities.

Abstract

Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B's reasoning performance, especially for underrepresented languages.

Multilingual Test-Time Scaling via Initial Thought Transfer

TL;DR

This work addresses the gap in multilingual reasoning by systematically evaluating test-time scaling across high- and low-resource Latin-script languages using two DeepSeek-R1 models on the multilingual AIME2025 benchmark. It reveals substantial language-dependent variability in scaling, with prevalent English leakage during long reasoning and divergent initial thought patterns for low-resource languages. The authors introduce MITT (Multilingual Initial Thought Transfer), a lightweight, unsupervised prefix-tuning approach that transfers high-resource reasoning prefixes to improve cross-language reasoning without cross-lingual supervision. MITT demonstrates improved reasoning performance and more stable scaling for DeepSeek-R1-Distill-Qwen-7B, especially in underrepresented languages, highlighting a practical path to strengthen multilingual reasoning fidelity at inference time. The findings underscore the importance of reasoning-grounded multilingual generalization and provide diagnostic tools and an effective intervention to reduce cross-language disparities.

Abstract

Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning. Building on our findings, we introduce MITT (Multilingual Initial Thought Transfer), an unsupervised and lightweight reasoning prefix-tuning approach that transfers high-resource reasoning prefixes to enhance test-time scaling across all languages, addressing inconsistencies in multilingual reasoning performance. MITT significantly boosts DeepSeek-R1-Distill-Qwen-7B's reasoning performance, especially for underrepresented languages.

Paper Structure

This paper contains 22 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Effect of Multilingual Initial Thought Transfer (MITT) on DeepSeek-R1-Distill-Qwen-7B. We observe that fine-tuning on initial reasoning steps in English (<32 tokens) offers an unsupervised and data efficient way to not only improve accuracy but also enhance progressive gains from scaling test-time compute in low as well as high resource language settings.
  • Figure 2: Test-time scaling trends for (a) + (b) DeepSeek-R1-Distill-LLama-8B and (c) + (d) DeepSeek-R1-Distill-Qwen-7B: . (a) and (c) display overall trends in test-time scaling across all languages, while (b) and (d) present average gains separately for low-resource and high-resource language groups. The results reveal a consistent pattern: both models demonstrate stronger test-time scaling in high-resource languages compared to low-resource ones. DeepSeek-R1-Distill-Qwen-7B shows insignificant test-time scaling for low-resource languages.
  • Figure 3: Visualization of the sentence-level language trajectory across the generated reasoning stream. Each row represents the dominant language detected at each generation segment, with panels (a) and (b) corresponding to DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B, respectively. While the models are prompted monolingually in the target language, we observe the intrusion of English into the generation process.
  • Figure 4: Comparison of the similarity of initial reasoning segments (upto 1000 tokens) between English and other target languages. To enhance interpretability, we apply a rolling average with a window size of 5 in (b).
  • Figure 5: Distribution of similarity scores across all questions, for initial reasoning segments (first 32 tokens) sampled from 100 generations per question across six languages. All reasoning chains are first translated to English using Gemini and then embedded using an English monolingual encoder to compute pairwise intra-language similarity. (a) Results for DeepSeek-R1-Distill-LLama-8B, and (b) results for DeepSeek-R1-Distill-Qwen-7B.
  • ...and 4 more figures