Large Language Model-guided Document Selection
Xiang Kong, Tom Gunter, Ruoming Pang
TL;DR
This work tackles the compute bottleneck of large language model pre-training by proposing LMDS, a two-stage, scalable document selection pipeline that uses a strong LM to label a sample and a distilled, cheaper LM to apply those labels to a vast corpus. By filtering out roughly 75% of the data, models trained on the remaining subset achieve comparable or superior performance across CoreEN and MMLU benchmarks with up to $70\%$ of the FLOPs, demonstrating notable data efficiency. The study systematically analyzes the effects of labeling prompts, labeling model capacity, distillation model size, and in-context learning, showing that more capable labelers and larger distillation models yield better robustness and results. All experiments are conducted with open-source datasets and evaluation frameworks to enable reproducibility, and the results support expanding data selection to unfiltered scales and domain-specific prompting in future work.
Abstract
Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.
