Table of Contents
Fetching ...

Lost in Literalism: How Supervised Training Shapes Translationese in LLMs

Yafu Li, Ronghao Zhang, Zhilin Wang, Huajian Zhang, Leyang Cui, Yongjing Yin, Tong Xiao, Yue Zhang

TL;DR

This work addresses translationese in LLM-based translations, showing that translationese persists due to biases introduced during supervised fine-tuning despite broad pretraining on natural language. It introduces a systematic framework combining expert-span annotations and a Translationese Span Ratio ($TSR$), revealing significant translationese across English-Chinese and German-English translations. The authors propose two training-aware mitigations: polishing golden references and filtering unnatural training instances, with experiments indicating substantial improvements in translation naturalness and quality across multiple languages, supported by both automatic metrics ($PPL$, lexical density, length variance, COMET-QE) and human judgments. The findings highlight data quality and training procedures as key levers for producing fluent, target-language-consistent translations, and they release data and code to foster further research and practical adoption.

Abstract

Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at https://github.com/yafuly/LLM_Translationese.

Lost in Literalism: How Supervised Training Shapes Translationese in LLMs

TL;DR

This work addresses translationese in LLM-based translations, showing that translationese persists due to biases introduced during supervised fine-tuning despite broad pretraining on natural language. It introduces a systematic framework combining expert-span annotations and a Translationese Span Ratio (), revealing significant translationese across English-Chinese and German-English translations. The authors propose two training-aware mitigations: polishing golden references and filtering unnatural training instances, with experiments indicating substantial improvements in translation naturalness and quality across multiple languages, supported by both automatic metrics (, lexical density, length variance, COMET-QE) and human judgments. The findings highlight data quality and training procedures as key levers for producing fluent, target-language-consistent translations, and they release data and code to foster further research and practical adoption.

Abstract

Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at https://github.com/yafuly/LLM_Translationese.

Paper Structure

This paper contains 30 sections, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Proportions of translations exhibiting translationese errors. All LLMs adopt direct translation prompts, with the exception of GPT-3.5 and GPT-4, which incorporate supplementary prompts to facilitate more natural translations. Both "Specified" and "Polishing" prompts have identical requirements; however, the 'Polishing' prompt specifically instructs LLMs to refine their generated translations.
  • Figure 2: Correlation between the human-annotated translation span ratio (TSR) and LLM-generated perplexity.
  • Figure 3: Proportions of supervised training instances exhibiting different levels of translationese errors (TSR).
  • Figure 4: Comparison of naturalness between inference-time (Post-Polishing) and training-time polishing (Polished).
  • Figure 5: Translation naturalness and quality w.r.t. filtered training samples.
  • ...and 1 more figures