Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang; Yunlong Liang; Fandong Meng; Songming Zhang; Kaiyu Huang; Yufeng Chen; Jinan Xu; Jie Zhou

Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Kaiyu Huang, Yufeng Chen, Jinan Xu, Jie Zhou

TL;DR

The paper tackles multilingual reasoning in large reasoning models by addressing input-output language inconsistency and weaker non-English reasoning. It introduces M-Thinker, a GRPO-trained model that uses a Language Consistency reward to enforce language fidelity and a Cross-lingual Thinking Alignment reward to transfer English reasoning to other languages. Experiments on MMATH and PolyMath demonstrate near-ideal language consistency and strong multilingual accuracy, with notable out-of-domain generalization. The authors present a scalable training pipeline—comprising cold-start SFT, rejection sampling, and iterative RL—to show that language-consistent multilingual reasoning can approach English-level performance across multiple languages.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the "think-then-answer" paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

TL;DR

Abstract

Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)