Table of Contents
Fetching ...

Unlocking Reasoning Capability on Machine Translation in Large Language Models

Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi

TL;DR

Reasoning-oriented LLMs often rely on chain-of-thought to improve performance, but applying this to machine translation can backfire due to the translation task’s constraint-driven nature. The authors systematically evaluate several reasoning LLMs on the WMT24++ MT benchmark and find that explicit reasoning traces are typically linear and unrevised, failing to improve MT; injecting higher-quality reasoning traces from stronger models also yields no reliable gains. They then design a structured, MT-tailored reasoning framework that uses multi-turn drafting, adequacy/fluency refinements, and dynamic reasoning templates, trained on 28k synthetic traces. Post-training a large reasoning model on this data yields significant translation-quality improvements over standard fine-tuning and generic reasoning baselines, highlighting that task-specific reasoning structures are crucial for MT.

Abstract

Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.

Unlocking Reasoning Capability on Machine Translation in Large Language Models

TL;DR

Reasoning-oriented LLMs often rely on chain-of-thought to improve performance, but applying this to machine translation can backfire due to the translation task’s constraint-driven nature. The authors systematically evaluate several reasoning LLMs on the WMT24++ MT benchmark and find that explicit reasoning traces are typically linear and unrevised, failing to improve MT; injecting higher-quality reasoning traces from stronger models also yields no reliable gains. They then design a structured, MT-tailored reasoning framework that uses multi-turn drafting, adequacy/fluency refinements, and dynamic reasoning templates, trained on 28k synthetic traces. Post-training a large reasoning model on this data yields significant translation-quality improvements over standard fine-tuning and generic reasoning baselines, highlighting that task-specific reasoning structures are crucial for MT.

Abstract

Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.
Paper Structure (29 sections, 3 figures, 4 tables)

This paper contains 29 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The performance of open-weights RLMs when their reasoning traces are generated and injected from other models. The dash-line represents the baseline performance of the receiving models, and the injecting models are presented at the top. As can be seen, the strength of the injecting model used to generate reasoning traces does not correlate with final translation quality across different injecting and receiving models.
  • Figure 2: An example of the dynamic template used to generate structured reasoning traces. Curly-braced fields denote placeholders. Steps 2 and 3 are applied only to selected challenging sentences, those sentences that benefited the most from steps 2 and 3 with the defined MetricX margin; all other segments are discarded from the initial draft.
  • Figure 3: The distribution of language pairs in 28k structured reasoning traces as the training data. Most language pairs have a very low frequency. We put them in the Other category for presentation.