An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)
Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu
TL;DR
HPLT v2 expands multilingual data resources for high-performance language technologies by releasing 193-language monolingual and 50-language English-parallel corpora, plus DocHPLT v2 and MultiHPLT v2 extensions. The paper details an end-to-end, open-source data pipeline leveraging Trafilatura, OpenLID, MinHash deduplication, and Bitextor-based parallelization, with extensive quality analyses (manual inspection, domain/register labeling) and robust empirical evaluations (MLMs, NLUs, MT). Results show improved data quality and stronger performance across linguistic tasks and translation benchmarks compared with prior releases, highlighting the value of large-scale, openly available multilingual corpora for model pretraining and evaluation. Limitations include language- and domain imbalances favoring Indo-European languages, with planned expansion to under-served languages and further document-level parallel resources, accompanied by ethical and environmental considerations documented in the release.
Abstract
Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora, extending prior work of the HPLT project. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.
