Table of Contents
Fetching ...

Recovering document annotations for sentence-level bitext

Rachel Wicks, Matt Post, Philipp Koehn

TL;DR

This work tackles the scarcity of document-level training data for machine translation by reconstructing document-level annotations for three major parallel corpora (ParaCrawl, News Commentary, Europarl) across six languages and introducing a document-level filtering method. It trains context-aware Transformer MT models on the resulting ParaDocs data, combining document-context streams with supplementary sentence-level data, and demonstrates improvements in both general translation quality and the handling of context-dependent phenomena. The authors release ParaDocs and the trained models as resources to advance document-aware MT research. Overall, the approach shows that preserving and exploiting document structure can boost document-level translation without harming sentence-level performance, offering practical benefits for real-world MT applications.

Abstract

Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, ParaDocs, and resulting models as a resource to the community.

Recovering document annotations for sentence-level bitext

TL;DR

This work tackles the scarcity of document-level training data for machine translation by reconstructing document-level annotations for three major parallel corpora (ParaCrawl, News Commentary, Europarl) across six languages and introducing a document-level filtering method. It trains context-aware Transformer MT models on the resulting ParaDocs data, combining document-context streams with supplementary sentence-level data, and demonstrates improvements in both general translation quality and the handling of context-dependent phenomena. The authors release ParaDocs and the trained models as resources to advance document-aware MT research. Overall, the approach shows that preserving and exploiting document structure can boost document-level translation without harming sentence-level performance, offering practical benefits for real-world MT applications.

Abstract

Data availability limits the scope of any given task. In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, ParaDocs, and resulting models as a resource to the community.
Paper Structure (28 sections, 1 figure, 11 tables)

This paper contains 28 sections, 1 figure, 11 tables.

Figures (1)

  • Figure 1: An example from ParaCrawl. The existing bitext has no contextual information. A model is trained to produce "Sie" (a feminine pronoun) from "It" without appropriate context. We restore this information by finding text in the corresponding monolingual dumps, and add document, paragraph, sentence, and character offset metadata.