Table of Contents
Fetching ...

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Alex Fang, Hadi Pouransari, Matt Jordan, Alexander Toshev, Vaishaal Shankar, Ludwig Schmidt, Tom Gunter

TL;DR

The paper investigates how data filtering interacts with training scale for large language models, focusing on the tension between data quality and data quantity. It systematically studies repeating filtered datasets and, crucially, documents-level repetition strategies, showing that multi-epoch repetition of high-quality filtered data can outperform larger, less filtered datasets when the training recipe is properly adjusted, including weight-decay scheduling. A key contribution is demonstrating that document-level manipulation and count-based oversampling can further improve dataset effectiveness under tight token budgets, offering practical guidance for smaller models and specialized pre-training. The findings argue that data filtering remains a valuable and practical research direction as models scale, providing actionable strategies for dataset construction and training regimes under varied compute constraints.

Abstract

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

TL;DR

The paper investigates how data filtering interacts with training scale for large language models, focusing on the tension between data quality and data quantity. It systematically studies repeating filtered datasets and, crucially, documents-level repetition strategies, showing that multi-epoch repetition of high-quality filtered data can outperform larger, less filtered datasets when the training recipe is properly adjusted, including weight-decay scheduling. A key contribution is demonstrating that document-level manipulation and count-based oversampling can further improve dataset effectiveness under tight token budgets, offering practical guidance for smaller models and specialized pre-training. The findings argue that data filtering remains a valuable and practical research direction as models scale, providing actionable strategies for dataset construction and training regimes under varied compute constraints.

Abstract

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and deduplicated datasets will become a practical constraint. In efforts to better understand how to proceed, we study model performance at various compute budgets and across multiple pre-training datasets created through data filtering and deduplication. We find that, given appropriate modifications to the training recipe, repeating existing aggressively filtered datasets for up to ten epochs can outperform training on the ten times larger superset for a single epoch across multiple compute budget orders of magnitude. While this finding relies on repeating the dataset for many epochs, we also investigate repeats within these datasets at the document level. We find that not all documents within a dataset are equal, and we can create better datasets relative to a token budget by explicitly manipulating the counts of individual documents. We conclude by arguing that even as large language models scale, data filtering remains an important direction of research.

Paper Structure

This paper contains 23 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Repeating the aggressively filtered dataset DCLM-baseline for up to ten epochs consistently outperforms training on a single epoch of the ten times larger RefinedWeb supserset (cyan)---provided that we adapt the weight decay for the high-repetition runs. On the left, we show that this result appears consistent across compute budgets (model sizes of 1B, 3B, 7B, and 12.6B). On the right we show that, at the largest compute budget tested (12.6B parameters, 252B total tokens seen), adapting the weight decay as a function of repetition allows for a significantly better result versus training on the superset. Results are evaluated on the centered core metric from DCLM, which is a normalized average over 22 tasks. We also include a MMLU version of the right side in Appendix \ref{['app:mmlu']}.
  • Figure 2: C4 perplexity as a function of training dataset and repetition count. Over-repeating data leads to diminishing returns irrespective of the training dataset chosen. For each dataset, we vary the number of unique tokens available, and then vary the training token budget. In the left plot we train 1B parameter models to show that heavy repetition results in a similar degradation in validation loss for all datasets tested. The right hand side shows that this effect holds at larger compute budgets for similarly over-trained models on the DCLM dataset.
  • Figure 3: Downstream accuracy average of 22 tasks as different datasets are repeated and overtrained. The left side only contains 1B models across a variety of datasets and unique tokens, while the right side only trains on DCLM-baseline while varying in model size and unique tokens.
  • Figure 4: Downstream accuracy average of 22 tasks of DCLM-baseline dataset as we fix total tokens trained and vary epochs and unique tokens seen. Dotted horizontal line represents training one epoch on the RefinedWeb baseline, which we compare against because DCLM-baseline is additional filtering on top of RefinedWeb. Legend denotes model size and multiplier to scale tokens seen at Chinchilla optimal.
  • Figure 5: Examples of strategies for count manipulation. Note that Greedy 1 copy is similar to global deduplication, and can result in worse performance if document quality varies significantly. For example, if the dataset is split evenly between high quality documents and low quality documents, it may be better to repeat the high quality documents.
  • ...and 5 more figures