Zyda: A 1.3T Dataset for Open Language Modeling
Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony
TL;DR
Zyda addresses the need for large-scale open, high-quality pretraining data by unifying major permissively licensed datasets and applying rigorous two-stage filtering plus cross-dataset deduplication. The resulting $1.3$T-token corpus demonstrates improved language-modeling performance over open baselines, with gains largely attributable to data quality improvements from the processing pipeline and deduplication. Across equi-token comparisons, Zyda's advantages grow with model scale, and ablations indicate removing code-heavy elements (e.g., StarCoder) can sharpen language-focused performance for smaller models. The work presents a practical, open resource for large-scale pretraining and suggests avenues for further enhancement by combining Zyda with additional datasets like FineWeb to approach frontier capabilities.
Abstract
The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.
