R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning Learning
Minggui He, Yilun Liu, Shimin Tao, Yuanchang Luo, Hongyong Zeng, Chang Su, Li Zhang, Hongxia Ma, Daimeng Wei, Weibin Meng, Hao Yang, Boxing Chen, Osamu Yoshie
TL;DR
This work addresses the lack of inference-time reasoning in machine translation by introducing R1-T1, a framework that fully incentivizes reasoning through reinforcement learning (RL) guided by human-aligned chain-of-thought (CoT) templates. It extends reasoning-based MT from niche tasks to general multilingual and domain translation by formalizing six CoT templates that reflect human translator strategies, and enables self-evolving CoTs via RL. The model undergoes two stages—supervised fine-tuning on a reasoning-enhanced seed dataset, then RL-based exploration using a GRPO algorithm with a reward design balancing formatting and translation quality—yielding improvements across 10+ languages and unseen directions, with reported gains such as a $9.6\%$ average improvement over plain SFT. Human evaluation confirms gains in accuracy and fluency, and the CoT self-evolution analysis demonstrates adaptive, context-aware translations, supporting broader applicability in real-world MT; the authors also open-source datasets and code to spur further research.
Abstract
Despite recent breakthroughs in reasoning-enhanced large language models (LLMs) like DeepSeek-R1, incorporating inference-time reasoning into machine translation (MT), where human translators naturally employ structured, multi-layered reasoning chain-of-thoughts (CoTs), is yet underexplored. Existing methods either design a fixed CoT tailored for a specific MT sub-task (e.g., literature translation), or rely on synthesizing CoTs unaligned with humans and supervised fine-tuning (SFT) prone to overfitting, limiting their adaptability to diverse translation scenarios. This paper introduces R1-Translator (R1-T1), a novel framework to achieve inference-time reasoning for general MT via reinforcement learning (RL) with human-aligned CoTs comprising six common patterns. Our approach pioneers three innovations: (1) extending reasoning-based translation to broader MT scenarios (e.g., multilingual MT, domain MT) unseen in the training phase; (2) formalizing six expert-curated CoT templates that mirror hybrid human strategies like context-aware paraphrasing and back translation; and (3) enabling self-evolving CoT discovery through RL. Both human and automatic evaluation results indicate a steady translation performance improvement in a total of 10+ languages and 40+ translation directions on Flores-101 test set and four domain-specific MT tasks, especially on the languages unseen from training.
