Zyda: A 1.3T Dataset for Open Language Modeling

Yury Tokpanov; Beren Millidge; Paolo Glorioso; Jonathan Pilault; Adam Ibrahim; James Whittington; Quentin Anthony

Zyda: A 1.3T Dataset for Open Language Modeling

Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony

TL;DR

Zyda addresses the need for large-scale open, high-quality pretraining data by unifying major permissively licensed datasets and applying rigorous two-stage filtering plus cross-dataset deduplication. The resulting $1.3$T-token corpus demonstrates improved language-modeling performance over open baselines, with gains largely attributable to data quality improvements from the processing pipeline and deduplication. Across equi-token comparisons, Zyda's advantages grow with model scale, and ablations indicate removing code-heavy elements (e.g., StarCoder) can sharpen language-focused performance for smaller models. The work presents a practical, open resource for large-scale pretraining and suggests avenues for further enhancement by combining Zyda with additional datasets like FineWeb to approach frontier capabilities.

Abstract

The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.

Zyda: A 1.3T Dataset for Open Language Modeling

TL;DR

T-token corpus demonstrates improved language-modeling performance over open baselines, with gains largely attributable to data quality improvements from the processing pipeline and deduplication. Across equi-token comparisons, Zyda's advantages grow with model scale, and ablations indicate removing code-heavy elements (e.g., StarCoder) can sharpen language-focused performance for smaller models. The work presents a practical, open resource for large-scale pretraining and suggests avenues for further enhancement by combining Zyda with additional datasets like FineWeb to approach frontier capabilities.

Abstract

Paper Structure (17 sections, 1 equation, 8 figures, 9 tables)

This paper contains 17 sections, 1 equation, 8 figures, 9 tables.

Introduction
Dataset Composition and processing
Composition
Filtering
Deduplication
Performance
Related Work
Discussion
Limitations
Ablation experimental details
Societal impacts
Ablation performance by evals
Additional details for dataset processing
Additional Filtering Details
Number of documents removed by each filter per dataset
...and 2 more sections

Figures (8)

Figure 1: The proportion of different dataset subsets in Zyda. The primary proportion is RefinedWeb, followed by SlimPajama and StarCoder.
Figure 2: Document similarity distances for Zyda dataset
Figure 3: Aggregate evaluation scores across training steps for a 1.4B model trained on 50B tokens of Zyda and comparable datasets. Zyda and especially Zyda without starcoder outperform strong open baselines such as FineWeb, and Dolma. Aggregate scores is the mean of arc-challenge, arc-easy, boolq, openbookqa, piqa, sciq, and winogrande. Scores smoothed using a window size of 5.
Figure 4: We match the Pythia suite in architecture and training hyperparameters. We observe that Zyda outperforms Pile on evaluations and that this advantage increases with scale, which we believe to be due to reduced noise on standard evals as model performance improves. All models were trained for 300B tokens on either Pile or Zyda.
Figure 5: Comparison of Zyda with alternative datasets, and across deduplication LSH
...and 3 more figures

Zyda: A 1.3T Dataset for Open Language Modeling

TL;DR

Abstract

Zyda: A 1.3T Dataset for Open Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (8)