LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens
Armel Zebaze, Rachel Bawden, Benoît Sagot
TL;DR
This work systematically evaluates whether intermediate reasoning tokens (thinking traces) improve machine translation with large language models. It finds that generic thinking prompts do not enhance MT performance and that CoT distillation generally underperforms standard fine-tuning, while MT-specific prompting strategies can yield gains when their traces include translation attempts. The strongest improvements arise from traces that reflect drafting and translation efforts, yet the most reliable gains come from improving target translations and expanding parallel data rather than distilling CoT explanations. Overall, parallel data quality and quantity dominate MT performance, with reasoning traces offering limited, context-dependent benefits. The findings advise focusing on high-quality translations and data augmentation over enforcing explicit thinking in MT models.
Abstract
Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.
