Table of Contents
Fetching ...

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Wafaa Mohammed, Vlad Niculae, Chrysoula Zerva

TL;DR

This paper addresses the challenge that large language models struggle with discourse phenomena in document-level translation. It introduces quality-aware decoding (QAD), an MBR-based inference strategy that leverages translation-quality signals to extract latent discourse knowledge from LLMs. Through comprehensive evaluation across six language pairs on DELA, TED2020, and WMT24++, the study shows that QAD consistently improves both overall translation quality and discourse handling, with context-aware prompting enhancing discourse phenomena further and TowerInstruct-13B often achieving the best results. The work contributes a rigorous evaluation framework, ablation studies of decoding setups, and human-annotated discourse data, demonstrating that QAD can make LLMs more suitable for real-world, discourse-sensitive translation tasks and offering practical guidance for deploying discourse-aware MT systems.

Abstract

Large language models (LLMs) have emerged as strong contenders in machine translation.Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.

Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

TL;DR

This paper addresses the challenge that large language models struggle with discourse phenomena in document-level translation. It introduces quality-aware decoding (QAD), an MBR-based inference strategy that leverages translation-quality signals to extract latent discourse knowledge from LLMs. Through comprehensive evaluation across six language pairs on DELA, TED2020, and WMT24++, the study shows that QAD consistently improves both overall translation quality and discourse handling, with context-aware prompting enhancing discourse phenomena further and TowerInstruct-13B often achieving the best results. The work contributes a rigorous evaluation framework, ablation studies of decoding setups, and human-annotated discourse data, demonstrating that QAD can make LLMs more suitable for real-world, discourse-sensitive translation tasks and offering practical guidance for deploying discourse-aware MT systems.

Abstract

Large language models (LLMs) have emerged as strong contenders in machine translation.Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.

Paper Structure

This paper contains 33 sections, 2 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Difference between QAD and greedy LLM-as-a-judge scores on TED2020 data. The plot demonstrates that QAD improves the performance of LLMs.
  • Figure 2: Human-annotated accuracy of greedy and QAD outputs in handling discourse phenomena, averaged across all languages where the phenomena occur (number of languages is at the bottom). Arabic is excluded from this plot to avoid model-specific biases.
  • Figure 3: Semantic difference vs. preference summed over all languages (except Arabic).
  • Figure 4: Difference between QAD and greedy LLM-as-a-judge scores on WMT24++ data. The plot demonstrates that QAD improves the performance of LLMs.
  • Figure 5: Human-annotated accuracy of greedy and QAD outputs in handling discourse phenomena for Arabic data.
  • ...and 9 more figures