Long Chain-of-Thought Reasoning Across Languages

Josh Barua; Seun Eisape; Kayo Yin; Alane Suhr

Long Chain-of-Thought Reasoning Across Languages

Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

TL;DR

This paper investigates whether long chain-of-thought (CoT) reasoning transfers from English to nine non-English languages across four development stages: scaling, pretraining, post-training, and inference. It introduces En-CoT and Target-CoT settings to dissociate input understanding from reasoning and reports that scaling improves input comprehension but Target-CoT remains far behind En-CoT for long multi-step tasks. Broad multilingual pretraining closes comprehension gaps and boosts both modes, while specialized reasoning pretraining helps English reasoning but can degrade target-language reasoning. Post-training with synthetic data shows translation-based traces often outperform distilled traces, with small target-language datasets yielding comparable gains to much larger English datasets, enabling practical gains for mid- and low-resource languages.

Abstract

While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Long Chain-of-Thought Reasoning Across Languages

TL;DR

Abstract

Long Chain-of-Thought Reasoning Across Languages

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)