Table of Contents
Fetching ...

Escaping the sentence-level paradigm in machine translation

Matt Post, Marcin Junczys-Dowmunt

TL;DR

This work argues that translating at the document level—not the sentence level—yields clearer resolution of discourse phenomena and better overall translation. It shows that a high-capacity standard Transformer, when trained on document-level samples sourced from backtranslated monolingual data, can outperform sentence-level baselines, and that parallel data can even harm performance in this setting. The authors also reveal that traditional contrastive evaluation falls short of capturing generative document capabilities and introduce generative variants of the existing tests to better assess discourse-aware translation. Across four language pairs and large-scale data, the approach demonstrates meaningful gains in document-level translation, while also highlighting the need for discourse-dense evaluation and larger model capacity to fully exploit context.

Abstract

It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DE$\rightarrow$EN, EN$\rightarrow$DE, EN$\rightarrow$FR, and EN$\rightarrow$RU) establish the success of these three pieces together in improving document-level performance.

Escaping the sentence-level paradigm in machine translation

TL;DR

This work argues that translating at the document level—not the sentence level—yields clearer resolution of discourse phenomena and better overall translation. It shows that a high-capacity standard Transformer, when trained on document-level samples sourced from backtranslated monolingual data, can outperform sentence-level baselines, and that parallel data can even harm performance in this setting. The authors also reveal that traditional contrastive evaluation falls short of capturing generative document capabilities and introduce generative variants of the existing tests to better assess discourse-aware translation. Across four language pairs and large-scale data, the approach demonstrates meaningful gains in document-level translation, while also highlighting the need for discourse-dense evaluation and larger model capacity to fully exploit context.

Abstract

It is well-known that document context is vital for resolving a range of translation ambiguities, and in fact the document setting is the most natural setting for nearly all translation. It is therefore unfortunate that machine translation -- both research and production -- largely remains stuck in a decades-old sentence-level translation paradigm. It is also an increasingly glaring problem in light of competitive pressure from large language models, which are natively document-based. Much work in document-context machine translation exists, but for various reasons has been unable to catch hold. This paper suggests a path out of this rut by addressing three impediments at once: what architectures should we use? where do we get document-level information for training them? and how do we know whether they are any good? In contrast to work on specialized architectures, we show that the standard Transformer architecture is sufficient, provided it has enough capacity. Next, we address the training data issue by taking document samples from back-translated data only, where the data is not only more readily available, but is also of higher quality compared to parallel document data, which may contain machine translation output. Finally, we propose generative variants of existing contrastive metrics that are better able to discriminate among document systems. Results in four large-data language pairs (DEEN, ENDE, ENFR, and ENRU) establish the success of these three pieces together in improving document-level performance.
Paper Structure (35 sections, 3 figures, 16 tables)

This paper contains 35 sections, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Escaping the rut of sentence-level translation: (1) source documents from trustworthy data only, (2) feed them into large-capacity standard Transformer models, and (3) use test sets that evaluate a model's generative ability.
  • Figure 2: GenPro accuracies for EN-DE, reporting across all pronouns with extra-sentential anaphora.
  • Figure 3: Token context. GenPro accuracies for EN-DE as a function of the number of the total token budget (columns), including the payload, and the number of those tokens allocated to the left context (rows). Only complete sentences are adding to context. Leftover tokens are allocated to the right context.