Table of Contents
Fetching ...

R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning

Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie

TL;DR

This work addresses the lack of inference-time reasoning in machine translation by introducing R1-T1, a framework that fully incentivizes reasoning through reinforcement learning (RL) guided by human-aligned chain-of-thought (CoT) templates. It extends reasoning-based MT from niche tasks to general multilingual and domain translation by formalizing six CoT templates that reflect human translator strategies, and enables self-evolving CoTs via RL. The model undergoes two stages—supervised fine-tuning on a reasoning-enhanced seed dataset, then RL-based exploration using a GRPO algorithm with a reward design balancing formatting and translation quality—yielding improvements across 10+ languages and unseen directions, with reported gains such as a $9.6\%$ average improvement over plain SFT. Human evaluation confirms gains in accuracy and fluency, and the CoT self-evolution analysis demonstrates adaptive, context-aware translations, supporting broader applicability in real-world MT; the authors also open-source datasets and code to spur further research.

Abstract

Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to overfitting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation to broader MT scenarios (e.g., multilingual MT, domain MT) unseen in the training phase; (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery through RL. Both human and automatic evaluation results indicate a steady translation performance improvement in a total of 10+ languages and 40+ translation directions on Flores-101 test set and four domain-specific MT tasks, especially on the languages unseen from training.

R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning

TL;DR

This work addresses the lack of inference-time reasoning in machine translation by introducing R1-T1, a framework that fully incentivizes reasoning through reinforcement learning (RL) guided by human-aligned chain-of-thought (CoT) templates. It extends reasoning-based MT from niche tasks to general multilingual and domain translation by formalizing six CoT templates that reflect human translator strategies, and enables self-evolving CoTs via RL. The model undergoes two stages—supervised fine-tuning on a reasoning-enhanced seed dataset, then RL-based exploration using a GRPO algorithm with a reward design balancing formatting and translation quality—yielding improvements across 10+ languages and unseen directions, with reported gains such as a average improvement over plain SFT. Human evaluation confirms gains in accuracy and fluency, and the CoT self-evolution analysis demonstrates adaptive, context-aware translations, supporting broader applicability in real-world MT; the authors also open-source datasets and code to spur further research.

Abstract

Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to overfitting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation to broader MT scenarios (e.g., multilingual MT, domain MT) unseen in the training phase; (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery through RL. Both human and automatic evaluation results indicate a steady translation performance improvement in a total of 10+ languages and 40+ translation directions on Flores-101 test set and four domain-specific MT tasks, especially on the languages unseen from training.

Paper Structure

This paper contains 38 sections, 3 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Illustration on the self-evolving process of translation reasoning with RL incentivization.
  • Figure 2: Illustration on the construction of MT reasoning dataset and training of R1-T1.