Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Jupinder Parmar; Shrimai Prabhumoye; Joseph Jennings; Bo Liu; Aastha Jhunjhunwala; Zhilin Wang; Mostofa Patwary; Mohammad Shoeybi; Bryan Catanzaro

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha Jhunjhunwala, Zhilin Wang, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

TL;DR

The first systematic study across the entire pipeline of pretraining set construction identifies which methods translate to the largest gains in model accuracy on downstream evaluations and shows how such attribute information can be used to further refine and improve the quality of a pretraining set.

Abstract

The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire pipeline of pretraining set construction. First, we run ablations on existing techniques for pretraining set development to identify which methods translate to the largest gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality pretraining sets.

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

TL;DR

Abstract

Paper Structure (62 sections, 14 figures, 24 tables)

This paper contains 62 sections, 14 figures, 24 tables.

Introduction
Experimental Setup
Data Sources
Evaluation
Model Specifications
Data Curation
Methodology
Ablations
Data Selection
Methodology
Ablations
Data Sampling
Methodology
Ablations
English
...and 47 more sections

Figures (14)

Figure 1: Each step in the development process to go from a collection of data sources into a final pretraining set that produces a highly capable LM.
Figure 2: Distribution of document types in web crawl.
Figure 3: Distribution of content domains in web crawl.
Figure 4: Domains sorted by descending order of percentage of high quality documents.
Figure 5: Heatmap of domains by probability of toxic content. Adult and online communities contain the highest percentage of toxic content.
...and 9 more figures

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

TL;DR

Abstract

Data, Data Everywhere: A Guide for Pretraining Dataset Construction

Authors

TL;DR

Abstract

Table of Contents

Figures (14)