Table of Contents
Fetching ...

DataRater: Meta-Learned Dataset Curation

Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeffrey Dean, Hado van Hasselt, David Silver

TL;DR

DataRater tackles the challenge of data quality for large-scale pre-training by learning to value individual data points through meta-gradients, optimizing training efficiency on held-out data. Specifically, a per-example scorer $\phi_\eta$ assigns weights to data, enabling batch-level filtering or re-weighting during pre-training. Across datasets such as the Pile and C4/noclean and model scales from tens of millions to a billion parameters, DataRater achieves substantial compute savings (up to 46.6%) and often improves final downstream performance, with meta-training costs amortised across many runs. These results demonstrate the feasibility of meta-learning the data curation pipeline for modern LLMs and suggest opportunities for online adaptation and robustness to distribution shifts.

Abstract

The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.

DataRater: Meta-Learned Dataset Curation

TL;DR

DataRater tackles the challenge of data quality for large-scale pre-training by learning to value individual data points through meta-gradients, optimizing training efficiency on held-out data. Specifically, a per-example scorer assigns weights to data, enabling batch-level filtering or re-weighting during pre-training. Across datasets such as the Pile and C4/noclean and model scales from tens of millions to a billion parameters, DataRater achieves substantial compute savings (up to 46.6%) and often improves final downstream performance, with meta-training costs amortised across many runs. These results demonstrate the feasibility of meta-learning the data curation pipeline for modern LLMs and suggest opportunities for online adaptation and robustness to distribution shifts.

Abstract

The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.

Paper Structure

This paper contains 28 sections, 5 equations, 15 figures, 2 tables, 1 algorithm.

Figures (15)

  • Figure 1: Compute needed to achieve baseline performance using the DataRater. A baseline $1$B model is trained on the unfiltered dataset, while a second $1$B model is trained analogously on the same dataset, but filtered by a DataRater. The x-axis states the underlying dataset, with $3$ evaluation metrics each, while the y-axis shows the fraction of compute needed to match the baseline in grey. The overhead of using the DataRater for filtering online is shown in pink. To calculate this metric we convert both the LLM training step cost and the DR inference step costs into FLOPs (while also accounting for batch-size oversampling). 'Validation set' refers to the respective validation set of the underlying dataset. Net compute gain is shown in green. The figure shows that filtering data using the DataRater results in significant overall compute gains for lower quality datasets like the Pile & C4/noclean.
  • Figure 2: DataRater meta-learning schematic. This diagram shows how the DataRater is updated using meta-gradients to minimise an outer loss, by back-propagating through multiple inner model updates (we experimented with up to $8$ inner updates) over weighted inner data batches. For more details on the implementation see Algorithm \ref{['alg:datarater']}.
  • Figure 3: Motivating toy example: DataRater learns to weight examples in proportion to their corruption level. Given two datasets, $\mathcal{D}$ and $\hat{\mathcal{D}}$, drawn from the same distribution, but where sequences in $\hat{\mathcal{D}}$ have been corrupted with token-level noise of various levels, from 0 (clean data) to 1 (fully random tokens). The DataRater weights correlate increasingly well with the noise fraction, showing that the DataRater recognises that corrupt sequences are worse for training than clean ones.
  • Figure 4: Loss on validation subsets, over multiple model scales, as a function of the percentage of data filtered out according to DataRater models (one DataRater model is trained for each dataset). For each underlying dataset (the Pile, C4/noclean and C4), the data filtering proportion resulting in the best validation set performance remains constant across evaluated model scales (highlighted in green).
  • Figure 5: Learning curves of 1B models trained on three separate underlying datasets, with DataRater filtering and without. On both C4/noclean and the Pile, models trained on DataRater filtered data match the baseline's final performance while using considerably fewer training steps, while also resulting in improved final performance. Using a DataRater to filter C4, which is a considerably higher quality dataset, results in minor performance improvements on average (i.e. see Table \ref{['table:extended_results']} in appendix for full results) while exhibiting trade-offs in downstream evaluations as exemplified here.
  • ...and 10 more figures