Zyda-2: a 5 Trillion Token High-Quality Dataset

Yury Tokpanov; Paolo Glorioso; Quentin Anthony; Beren Millidge

Zyda-2: a 5 Trillion Token High-Quality Dataset

Yury Tokpanov, Paolo Glorioso, Quentin Anthony, Beren Millidge

TL;DR

Zyda-2 tackles the challenge of building a high-quality, open-source pretraining corpus at massive scale by combining multiple strong sources with a two-stage pipeline of cross-deduplication and model-based filtering. The authors demonstrate that this approach yields about 5 trillion tokens and enables state-of-the-art performance for Zamba2-2.7B models across multiple sizes, outperforming contemporary open datasets. Through targeted weighting experiments, they show that increasing the share of FineWeb-Edu improves results, while maintaining diversity from smaller datasets remains beneficial. The work provides practical guidance on open-source data curation, underscores the nuanced role of duplicates, and points to future directions in data filtering and synthetic augmentation to push the frontier of small-to-mid-size language models.

Abstract

In this technical report, we present Zyda-2: a five trillion token dataset for language model pretraining. Zyda-2 was used to train our Zamba2 series of models which are state-of-the-art for their weight class. We build Zyda-2 by collating high-quality open-source tokens such as FineWeb and DCLM, then distilling them to the highest-quality subset via cross-deduplication and model-based quality filtering. Zyda-2 is released under a permissive open license, and is available at https://huggingface.co/datasets/Zyphra/Zyda-2

Zyda-2: a 5 Trillion Token High-Quality Dataset

TL;DR

Abstract

Zyda-2: a 5 Trillion Token High-Quality Dataset

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)