Table of Contents
Fetching ...

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

Stephan Oepen, Nikolay Arefev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, Lucas Charpentier, Pinzhen Chen, Mariya Fedorova, Ona de Gibert, Barry Haddow, Jan Hajič, Jindřich Helcl, Andrey Kutuzov, Veronika Laippala, Zihao Li, Risto Luukkonen, Bhavitvya Malik, Vladislav Mikhailov, Amanda Myntti, Dayyán O'Brien, Lucie Poláková, Sampo Pyysalo, Gema Ramírez Sánchez, Janine Siewert, Pavel Stepachev, Jörg Tiedemann, Teemu Vahtola, Dušan Variš, Fedor Vitiugin, Tea Vojtěchová, Jaume Zaragoza

TL;DR

HPLT 3.0 addresses the lack of open, large-scale multilingual data for LLMs and MT by providing an open, ~30 trillion-token corpus across ~200 languages, built on an enhanced, scalable processing pipeline. It couples mono- and bilingual resources with a novel multilingual evaluation framework (HPLT-E) and a set of 57 monolingual encoder–decoder models, plus parallel data and synthetic MT-generated data to broaden language coverage. Empirical results show that HPLT 3.0 enables competitive multilingual performance, with data quality improvements yielding gains over prior resources and promising results from synthetic data generation. The work emphasizes transparency, reproducibility, and community collaboration to democratize multilingual NLP research and development.

Abstract

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models

TL;DR

HPLT 3.0 addresses the lack of open, large-scale multilingual data for LLMs and MT by providing an open, ~30 trillion-token corpus across ~200 languages, built on an enhanced, scalable processing pipeline. It couples mono- and bilingual resources with a novel multilingual evaluation framework (HPLT-E) and a set of 57 monolingual encoder–decoder models, plus parallel data and synthetic MT-generated data to broaden language coverage. Empirical results show that HPLT 3.0 enables competitive multilingual performance, with data quality improvements yielding gains over prior resources and promising results from synthetic data generation. The work emphasizes transparency, reproducibility, and community collaboration to democratize multilingual NLP research and development.

Abstract

We present an ongoing initiative to provide open, very large, high-quality, and richly annotated textual datasets for almost 200 languages. At 30 trillion tokens, this is likely the largest generally available multilingual collection of LLM pre-training data. These datasets are derived from web crawls from different sources and accompanied with a complete, open-source pipeline for document selection from web archives, text extraction from HTML, language identification for noisy texts, exact and near-deduplication, annotation with, among others, register labels, text quality estimates, and personally identifiable information; and final selection and filtering. We report on data quality probes through contrastive and analytical statistics, through manual inspection of samples for 24 languages, and through end-to-end evaluation of various language model architectures trained on this data. For multilingual LLM evaluation, we provide a comprehensive collection of benchmarks for nine European languages, with special emphasis on natively created tasks, mechanisms to mitigate prompt sensitivity, and refined normalization and aggregation of scores. Additionally, we train and evaluate a family of 57 monolingual encoder-decoder models, as well as a handful of monolingual GPT-like reference models. Besides the monolingual data and models, we also present a very large collection of parallel texts automatically mined from this data, together with a novel parallel corpus synthesized via machine translation.

Paper Structure

This paper contains 36 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Schematic overview of data preparation.
  • Figure 2: Comparison of models pretrained on FineWeb, HPLT 2.0, 3.0, and MADLAD-400.
  • Figure 3: Comparison of different WDS-based sampling strategies from the Spanish HPLT 3.0 data.