Table of Contents
Fetching ...

Long Chain-of-Thought Reasoning Across Languages

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

TL;DR

This paper investigates whether long chain-of-thought (CoT) reasoning transfers from English to nine non-English languages across four development stages: scaling, pretraining, post-training, and inference. It introduces En-CoT and Target-CoT settings to dissociate input understanding from reasoning and reports that scaling improves input comprehension but Target-CoT remains far behind En-CoT for long multi-step tasks. Broad multilingual pretraining closes comprehension gaps and boosts both modes, while specialized reasoning pretraining helps English reasoning but can degrade target-language reasoning. Post-training with synthetic data shows translation-based traces often outperform distilled traces, with small target-language datasets yielding comparable gains to much larger English datasets, enabling practical gains for mid- and low-resource languages.

Abstract

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Long Chain-of-Thought Reasoning Across Languages

TL;DR

This paper investigates whether long chain-of-thought (CoT) reasoning transfers from English to nine non-English languages across four development stages: scaling, pretraining, post-training, and inference. It introduces En-CoT and Target-CoT settings to dissociate input understanding from reasoning and reports that scaling improves input comprehension but Target-CoT remains far behind En-CoT for long multi-step tasks. Broad multilingual pretraining closes comprehension gaps and boosts both modes, while specialized reasoning pretraining helps English reasoning but can degrade target-language reasoning. Post-training with synthetic data shows translation-based traces often outperform distilled traces, with small target-language datasets yielding comparable gains to much larger English datasets, enabling practical gains for mid- and low-resource languages.

Abstract

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Paper Structure

This paper contains 27 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Example inputs, long chains-of-thought, and answers for En-Only, En-CoT, and Target-CoT settings, drawn from DeepSeek-R1-Distill-Llama-70B on AIME 2025. Orange boxes denote English text and blue boxes denote Swahili text. En-Only (input & reasoning in English) and En-CoT (input in Swahili, reasoning in English) lead to correct answers, while Target-CoT (input & reasoning in Swahili) contains a reasoning error (highlighted in red) and leads to an incorrect answer.
  • Figure 2: Evaluation of scaling trends in DeepSeek-R1-Distill models on AIME-Combined. For high- and mid-resource languages, En-CoT performance increases and approaches En-Only performance with scale, while Target-CoT performance is consistently lower than En-CoT, highlighting target-language reasoning as the bottleneck.
  • Figure 3: Inference efficiency of fine-tuned models on AIME-Combined in terms of tokens. Accuracy is negatively correlated with cost. Fine-tuning on translated data (right) narrows the efficiency gap across languages.
  • Figure 4: Distribution of error types found in incorrect responses from DeepSeek-R1-Distill-Llama-70B on AIME-Combined. The majority of errors in En-CoT stem from reasoning mistakes, while Target-CoT exhibits a higher proportion of output generation errors and conceptual errors compared to En-CoT.
  • Figure 5: Evaluation of scaling trends in DeepSeek-R1-Distill models on MMLU-ProX. DeepSeek-R1-Distill models demonstrate narrower gaps between Target-CoT and En-CoT on short CoT tasks, with target-language reasoning in Chinese and French even outperforming English at multiple scales.
  • ...and 3 more figures