Table of Contents
Fetching ...

TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

Xu Huang, Zhejian Lai, Zixian Huang, Jiajun Chen, Shujian Huang

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

TAPO: Translation Augmented Policy Optimization for Multilingual Mathematical Reasoning

Abstract

Large Language Models (LLMs) have demonstrated remarkable proficiency in English mathematical reasoning, yet a significant performance disparity persists in multilingual contexts, largely attributed to deficiencies in language understanding. To bridge this gap, we introduce Translation-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework built upon GRPO. TAPO enforces an explicit alignment strategy where the model leverages English as a pivot and follows an understand-then-reason paradigm. Crucially, we employ a step-level relative advantage mechanism that decouples understanding from reasoning, allowing the integration of translation quality rewards without introducing optimization conflicts. Extensive experiments reveal that TAPO effectively synergizes language understanding with reasoning capabilities and is compatible with various models. It outperforms baseline methods in both multilingual mathematical reasoning and translation tasks, while generalizing well to unseen languages and out-of-domain tasks.

Paper Structure

This paper contains 34 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Problem-level relationship between translation quality and reasoning accuracy on MGSM. Each point represents a distinct non-English problem, for which the accuracy of the corresponding English problem is over 0.9. The results are generated by Qwen2.5-3B-Instruct with 100 runs. The x-axis indicates the mean ChrF++ score of the translation, while the y-axis shows the overall accuracy. The dashed line represents the line of best fit, with the Pearson correlation coefficient $r=0.583$, $p<1e-5$.
  • Figure 2: The performance of Qwen2.5-3B-Instruct on MGSM with different prompts. The full prompts are in Appendix \ref{['sec:prompts']}. The last two are the same as Translate-Test, but the model's translation is replaced with Gemini2.5-Flash's and the reference, respectively.
  • Figure 3: The illustration of the step-level relative advantage calculation in TAPO.
  • Figure 4: The proportions of false positive and false negative advantages across each training step. The overall proportion is the sum of the two proportions.
  • Figure 5: The average number of response tokens on MGSM across all languages. Each bar represents total number of tokens of a response, and the hatched area represents the number of translation tokens.
  • ...and 2 more figures