The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon; Lucile Saulnier; Thomas Wang; Christopher Akiki; Albert Villanova del Moral; Teven Le Scao; Leandro Von Werra; Chenghao Mou; Eduardo González Ponferrada; Huu Nguyen; Jörg Frohberg; Mario Šaško; Quentin Lhoest; Angelina McMillan-Major; Gerard Dupont; Stella Biderman; Anna Rogers; Loubna Ben allal; Francesco De Toni; Giada Pistilli; Olivier Nguyen; Somaieh Nikpoor; Maraim Masoud; Pierre Colombo; Javier de la Rosa; Paulo Villegas; Tristan Thrush; Shayne Longpre; Sebastian Nagel; Leon Weber; Manuel Muñoz; Jian Zhu; Daniel Van Strien; Zaid Alyafeai; Khalid Almubarak; Minh Chien Vu; Itziar Gonzalez-Dios; Aitor Soroa; Kyle Lo; Manan Dey; Pedro Ortiz Suarez; Aaron Gokaslan; Shamik Bose; David Adelani; Long Phan; Hieu Tran; Ian Yu; Suhas Pai; Jenny Chim; Violette Lepercq; Suzana Ilic; Margaret Mitchell; Sasha Alexandra Luccioni; Yacine Jernite

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, Loubna Ben allal, Francesco De Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber, Manuel Muñoz, Jian Zhu, Daniel Van Strien, Zaid Alyafeai, Khalid Almubarak, Minh Chien Vu, Itziar Gonzalez-Dios, Aitor Soroa, Kyle Lo, Manan Dey, Pedro Ortiz Suarez, Aaron Gokaslan, Shamik Bose, David Adelani, Long Phan, Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Margaret Mitchell, Sasha Alexandra Luccioni, Yacine Jernite

TL;DR

This paper describes the construction of ROOTS, a 1.6TB multilingual corpus assembled through a value-driven, open collaboration under the BigScience project to train BLOOM. It details a two-phase data sourcing approach, combining community-curated resources with pseudo-crawled OSCAR data, and a robust processing pipeline that includes cleaning, deduplication, and PII mitigation. The work emphasizes transparency, governance, and tooling, releasing data-processing tools and a subset of ROOTS while addressing ethical and legal considerations. Overall, ROOTS demonstrates how large-scale multilingual corpora can be built responsibly with community involvement, thorough documentation, and reproducible pipelines to support multilingual NLP research.

Abstract

As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

TL;DR

Abstract

Paper Structure (41 sections, 12 figures, 3 tables, 1 algorithm)

This paper contains 41 sections, 12 figures, 3 tables, 1 algorithm.

Introduction
Outline of the Paper
Related Work
Large Language Models and Large Text Corpora
Tooling, Visualization, and Replication
Documenting Textual Corpora in NLP
(Crowd) Sourcing a Language Resource Catalogue
Obtaining Data from the Identified Resources
Gathering Identified Datasets and Collections.
Pseudo-Crawled Data.
GitHub Code.
Merging and Deduplicating Sources.
Processing Pipeline for Quality Improvement on Crowdsourced Datasets
Processing OSCAR
Data cleaning and filtering
...and 26 more sections

Figures (12)

Figure 1: Overview of ROOTS. Left: A treemap of natural language representation in number of bytes by language family. The bulk of the graph is overwhelmed by the 1321.89 GB allotted to Eurasia. The orange rectangle corresponds to the 18GB of Indonesian, the sole representative of the Papunesia macroarea, and the green rectangle to the 0.4GB of the Africa linguistic macroarea. Right: A waffle plot of the distribution of programming languages by number of files. One square corresponds approximately to 30,000 files.
Figure 2: Partial screenshot of the visualization tool. Users can look at how each function in the processing pipeline influenced high-level statistics. Influence on specific samples can be monitored via the same tool, see Appendix \ref{['appendix:visu_tool']}
Figure 3: Percentage of documents discarded by each filter independently for 5 languages
Figure 4: A raw size comparison to other corpora used to train large language models. The asterisk next to GPT-3 indicates the fact that the value in question is an estimate computed using the reported number of tokens and the average number of tokens per byte of text that the GPT-2 tokenizer produces on the Pile-CC, Books3, OWT2, and Wiki-en subsets of the Pile Gao2020
Figure 5: Size in bytes of every document in the corpus per language. The y-axis is in logarithmic scale. Box-and-whisker diagrams illustrate median, the first and third quartiles, whiskers drawn within the 1.5 IQR value and outliers
...and 7 more figures

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

TL;DR

Abstract

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Authors

TL;DR

Abstract

Table of Contents

Figures (12)