Table of Contents
Fetching ...

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

Mara Finkelstein, David Vilar, Markus Freitag

TL;DR

This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples, finding that training from scratch on this dataset outperforms training on the WMT'23 training dataset, and also outperforms training on the top-quality subset of the WMT'23 training dataset.

Abstract

Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.

Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data

TL;DR

This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples, finding that training from scratch on this dataset outperforms training on the WMT'23 training dataset, and also outperforms training on the top-quality subset of the WMT'23 training dataset.

Abstract

Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT'23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT'23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM's strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.
Paper Structure (26 sections, 2 equations, 4 figures, 16 tables)

This paper contains 26 sections, 2 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Distribution of English-German MBR sentence-level versus QE blob-level target lengths (computed using the Moses tokenizer).
  • Figure 2: Comparison of pretraining performance on NewsPaLM MBR sentence-level dataset versus NewsPaLM QE blob-level dataset, bucketed by source length (WMT'23 en$\rightarrow$de test set). Note that performance of the model trained on the blob-level data is stable across segment lengths, while performance of the model trained on the sentence-level data declines as segment length increases (according to both MetricX and Comet22 metrics).
  • Figure 3: Comparison of model performance when pretraining and finetuning on the full versus subsampled NewsPaLM MBR dataset (WMT'23 test set). The subsampled dataset is 25% of the size of the full dataset, and was sampled randomly. Note that pretraining performance drops substantially when training on the subsampled dataset (for both en$\rightarrow$de and de$\rightarrow$en), while finetuning performance is minimally affected.
  • Figure 4: PaLM-2 Bison few-shot versus NewsPaLM MBR-finetuned performance bucketed by source length (en$\rightarrow$de WMT'23 test set). Note that self-MBR finetuning (on sentence-level data only) improves performance across all source length buckets.