Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

Mara Finkelstein; David Vilar; Markus Freitag

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

Mara Finkelstein, David Vilar, Markus Freitag

TL;DR

This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples, finding that training from scratch on this dataset outperforms training on the WMT'23 training dataset, and also outperforms training on the top-quality subset of the WMT'23 training dataset.

Abstract

Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

TL;DR

Abstract

Paper Structure (26 sections, 2 equations, 4 figures, 16 tables)

This paper contains 26 sections, 2 equations, 4 figures, 16 tables.

Introduction
NewsPaLM Dataset
Source-side Data Collection: Newscrawl
Construction of "Blobs"
Cluster-Based Text Selection
MBR Decoding and QE Reranking
Candidate List Generation
MBR and QE scoring
Dataset Statistics
Experimental Setup
Datasets
Training Data
Development and Test Sets
Models
Evaluation
...and 11 more sections

Figures (4)

Figure 1: Distribution of English-German MBR sentence-level versus QE blob-level target lengths (computed using the Moses tokenizer).
Figure 2: Comparison of pretraining performance on NewsPaLM MBR sentence-level dataset versus NewsPaLM QE blob-level dataset, bucketed by source length (WMT'23 en$\rightarrow$de test set). Note that performance of the model trained on the blob-level data is stable across segment lengths, while performance of the model trained on the sentence-level data declines as segment length increases (according to both MetricX and Comet22 metrics).
Figure 3: Comparison of model performance when pretraining and finetuning on the full versus subsampled NewsPaLM MBR dataset (WMT'23 test set). The subsampled dataset is 25% of the size of the full dataset, and was sampled randomly. Note that pretraining performance drops substantially when training on the subsampled dataset (for both en$\rightarrow$de and de$\rightarrow$en), while finetuning performance is minimally affected.
Figure 4: PaLM-2 Bison few-shot versus NewsPaLM MBR-finetuned performance bucketed by source length (en$\rightarrow$de WMT'23 test set). Note that self-MBR finetuning (on sentence-level data only) improves performance across all source length buckets.

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

TL;DR

Abstract

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

Authors

TL;DR

Abstract

Table of Contents

Figures (4)