DataRater: Meta-Learned Dataset Curation
Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa M. Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeffrey Dean, Hado van Hasselt, David Silver
TL;DR
DataRater tackles the challenge of data quality for large-scale pre-training by learning to value individual data points through meta-gradients, optimizing training efficiency on held-out data. Specifically, a per-example scorer $\phi_\eta$ assigns weights to data, enabling batch-level filtering or re-weighting during pre-training. Across datasets such as the Pile and C4/noclean and model scales from tens of millions to a billion parameters, DataRater achieves substantial compute savings (up to 46.6%) and often improves final downstream performance, with meta-training costs amortised across many runs. These results demonstrate the feasibility of meta-learning the data curation pipeline for modern LLMs and suggest opportunities for online adaptation and robustness to distribution shifts.
Abstract
The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.
