Table of Contents
Fetching ...

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

Dayyán O'Brien, Bhavitvya Malik, Ona de Gibert, Pinzhen Chen, Barry Haddow, Jörg Tiedemann

TL;DR

DocHPLT introduces the largest publicly available multilingual document-level translation dataset, enabling substantial long-context modeling across 50 languages with English and 4.26B sentences. The authors adopt a document-first collection-based pipeline, preserve complete document structure, and enable English pivots to derive additional non-English pairs. Through extensive experiments, they show that fine-tuning LLMs on DocHPLT improves over prompting baselines, with particularly large gains for low-resource languages, while context strategies reveal 10-sentence chunks often outperform full document-to-document training. Multilingual fine-tuning yields mixed results, offering benefits for some languages but limited zero-shot transfer to unseen languages. The dataset and findings offer a valuable resource for advancing multilingual document-level translation and long-context NLP research, with open-source release facilitating broader adoption.

Abstract

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.

DocHPLT: A Massively Multilingual Document-Level Translation Dataset

TL;DR

DocHPLT introduces the largest publicly available multilingual document-level translation dataset, enabling substantial long-context modeling across 50 languages with English and 4.26B sentences. The authors adopt a document-first collection-based pipeline, preserve complete document structure, and enable English pivots to derive additional non-English pairs. Through extensive experiments, they show that fine-tuning LLMs on DocHPLT improves over prompting baselines, with particularly large gains for low-resource languages, while context strategies reveal 10-sentence chunks often outperform full document-to-document training. Multilingual fine-tuning yields mixed results, offering benefits for some languages but limited zero-shot transfer to unseen languages. The dataset and findings offer a valuable resource for advancing multilingual document-level translation and long-context NLP research, with open-source release facilitating broader adoption.

Abstract

Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences. By adding pivoted alignments, practitioners can obtain 2500 additional pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content, including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.

Paper Structure

This paper contains 37 sections, 1 equation, 1 figure, 9 tables.

Figures (1)

  • Figure 1: An example of good, bad (in blue), and multi-way alignments for English-Catalan docs.