Table of Contents
Fetching ...

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Armel Zebaze, Rachel Bawden, Benoît Sagot

TL;DR

This work systematically evaluates whether intermediate reasoning tokens (thinking traces) improve machine translation with large language models. It finds that generic thinking prompts do not enhance MT performance and that CoT distillation generally underperforms standard fine-tuning, while MT-specific prompting strategies can yield gains when their traces include translation attempts. The strongest improvements arise from traces that reflect drafting and translation efforts, yet the most reliable gains come from improving target translations and expanding parallel data rather than distilling CoT explanations. Overall, parallel data quality and quantity dominate MT performance, with reasoning traces offering limited, context-dependent benefits. The findings advise focusing on high-quality translations and data augmentation over enforcing explicit thinking in MT models.

Abstract

Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

TL;DR

This work systematically evaluates whether intermediate reasoning tokens (thinking traces) improve machine translation with large language models. It finds that generic thinking prompts do not enhance MT performance and that CoT distillation generally underperforms standard fine-tuning, while MT-specific prompting strategies can yield gains when their traces include translation attempts. The strongest improvements arise from traces that reflect drafting and translation efforts, yet the most reliable gains come from improving target translations and expanding parallel data rather than distilling CoT explanations. Overall, parallel data quality and quantity dominate MT performance, with reasoning traces offering limited, context-dependent benefits. The findings advise focusing on high-quality translations and data augmentation over enforcing explicit thinking in MT models.

Abstract

Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.

Paper Structure

This paper contains 36 sections, 15 figures, 8 tables.

Figures (15)

  • Figure 1: CoT Fine-Tuning (left): Given a source-target pair, a teacher is prompted to get a thought process on how to obtain the target given the source based on a given strategy (right). The obtained trace is used as intermediate information to fine-tune a student to "think" before translating.
  • Figure 2: Impact of the Temperature on the translation quality with and without thinking tokens.
  • Figure 3: Comparison between IOFT and CoTFT with six different CoT templates. Across all figures, each unit on the x-axis represents 200 steps.
  • Figure 4: Comparison between IOFT and CoTFT with five different prompting strategies.
  • Figure 5: Comparison between IOFT and CoTFT with six different prompting strategies.
  • ...and 10 more figures