Table of Contents
Fetching ...

Zyda-2: a 5 Trillion Token High-Quality Dataset

Yury Tokpanov, Paolo Glorioso, Quentin Anthony, Beren Millidge

TL;DR

Zyda-2 tackles the challenge of building a high-quality, open-source pretraining corpus at massive scale by combining multiple strong sources with a two-stage pipeline of cross-deduplication and model-based filtering. The authors demonstrate that this approach yields about 5 trillion tokens and enables state-of-the-art performance for Zamba2-2.7B models across multiple sizes, outperforming contemporary open datasets. Through targeted weighting experiments, they show that increasing the share of FineWeb-Edu improves results, while maintaining diversity from smaller datasets remains beneficial. The work provides practical guidance on open-source data curation, underscores the nuanced role of duplicates, and points to future directions in data filtering and synthetic augmentation to push the frontier of small-to-mid-size language models.

Abstract

In this technical report, we present Zyda-2: a five trillion token dataset for language model pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art for their weight class. We build Zyda-2 by collating high-quality open-source tokens such as FineWeb and DCLM, then distilling them to the highest-quality subset via cross-deduplication and model-based quality filtering. Zyda-2 is released under a permissive open license, and is available at https://huggingface.co/datasets/Zyphra/Zyda-2

Zyda-2: a 5 Trillion Token High-Quality Dataset

TL;DR

Zyda-2 tackles the challenge of building a high-quality, open-source pretraining corpus at massive scale by combining multiple strong sources with a two-stage pipeline of cross-deduplication and model-based filtering. The authors demonstrate that this approach yields about 5 trillion tokens and enables state-of-the-art performance for Zamba2-2.7B models across multiple sizes, outperforming contemporary open datasets. Through targeted weighting experiments, they show that increasing the share of FineWeb-Edu improves results, while maintaining diversity from smaller datasets remains beneficial. The work provides practical guidance on open-source data curation, underscores the nuanced role of duplicates, and points to future directions in data filtering and synthetic augmentation to push the frontier of small-to-mid-size language models.

Abstract

In this technical report, we present Zyda-2: a five trillion token dataset for language model pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art for their weight class. We build Zyda-2 by collating high-quality open-source tokens such as FineWeb and DCLM, then distilling them to the highest-quality subset via cross-deduplication and model-based quality filtering. Zyda-2 is released under a permissive open license, and is available at https://huggingface.co/datasets/Zyphra/Zyda-2

Paper Structure

This paper contains 11 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Dataset creation process for Zyda-2. We first collated the best open-source sets available, then ran cross-deduplication between these datasets, since they all ultimately derive mostly from a common source (common-crawl). Finally, we applied model-based quality filtering to the two unfiltered datasets (Zyda-1 and Dolma-CC).
  • Figure 2: The performance of a 1.4B model trained on 50B tokens with and without model-based filtering on the Zyda-1 and Dolma-CC datasets. The aggregate evaluation score is the mean across the following standard language modeling benchmarks: Hellaswag, PIQA, OpenBookQA, Arc-Challenge, Arc-Easy, and Winogrande. For the quality filtering we kept only those documents labeled as 'high-quality' by the model-based classifier.
  • Figure 3: Composition of Zyda-2
  • Figure 4: Performance of Zyda-2 vs other datasets as aggregate weighted evaluation score. This score is an average of MMLU, Hellaswag, PIQA, OpenBookQA, Arc-Challenge, Arc-Easy, and Winogrande. These scores are collected by annealing the base version of Zamba2-2.7B for roughly 40B tokens on each dataset.
  • Figure 5: The proportion of each of the component datasets comprising Zyda-2 using the optimal weighting. FineWeb-Edu and DCLM each account for approximately the same total proportion of the dataset.
  • ...and 5 more figures