Table of Contents
Fetching ...

MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu

TL;DR

This work addresses the challenge of applying pure RL to machine translation by introducing MT-R1-Zero, an open-source adaptation of the R1-Zero framework that uses a rule-metric mixed reward to guide reinforcement learning without supervised fine-tuning. By combining a format-checking component with lexical and semantic quality metrics within a GRPO training loop, MT-R1-Zero achieves competitive to state-of-the-art results on English-Chinese translation and demonstrates strong out-of-distribution and multilingual generalization. Key findings reveal that reward metric choice critically shapes optimization targets, that emergent translation thinking patterns develop during training, and that RL—not explicit thinking verbosity—is the principal driver of quality gains. The work provides practical recipes for reward design, model adaptability across architectures, and open-source resources to extend RL-based MT to diverse languages and domains.

Abstract

Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.

MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning

TL;DR

This work addresses the challenge of applying pure RL to machine translation by introducing MT-R1-Zero, an open-source adaptation of the R1-Zero framework that uses a rule-metric mixed reward to guide reinforcement learning without supervised fine-tuning. By combining a format-checking component with lexical and semantic quality metrics within a GRPO training loop, MT-R1-Zero achieves competitive to state-of-the-art results on English-Chinese translation and demonstrates strong out-of-distribution and multilingual generalization. Key findings reveal that reward metric choice critically shapes optimization targets, that emergent translation thinking patterns develop during training, and that RL—not explicit thinking verbosity—is the principal driver of quality gains. The work provides practical recipes for reward design, model adaptability across architectures, and open-source resources to extend RL-based MT to diverse languages and domains.

Abstract

Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.

Paper Structure

This paper contains 22 sections, 4 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Performance comparison of contemporary LLM-based translation systems on the WMT 24 EN-ZH test set, plotted by average score across BLEU, COMETKiwi, and XCOMET versus model size.
  • Figure 2: Training dynamics using Reward-Lex, Reward-Sem, and Reward-Mix, evaluated with COMETKiwi, BLEU, and XCOMET.
  • Figure 3: Qualitative examples illustrates the effect of different reward functions (Reward-Lex, Reward-Sem, Reward-Mix) on EN-ZH translation, where the stylistic differences are driven by reward optimization (Finding 1).
  • Figure 4: Training dynamics of MT-R1-Zero models (using Reward-Sem). Left: COMETKiwi score progression for 3B and 7B models on EN-ZH and ZH-EN test sets. Right: Average response length changes over training steps, exhibiting the classic decrease-then-increase pattern (Finding 2).
  • Figure 5: Evolution of an MT-R1-Zero model's reasoning process and translation output for the Chinese source text "其影响可能类似于2008年的经济危机"at different training steps (0, 400, 1600), showcasing the shift from decomposition to more semantic analysis (Finding 2).
  • ...and 7 more figures