Table of Contents
Fetching ...

Large Language Model-guided Document Selection

Xiang Kong, Tom Gunter, Ruoming Pang

TL;DR

This work tackles the compute bottleneck of large language model pre-training by proposing LMDS, a two-stage, scalable document selection pipeline that uses a strong LM to label a sample and a distilled, cheaper LM to apply those labels to a vast corpus. By filtering out roughly 75% of the data, models trained on the remaining subset achieve comparable or superior performance across CoreEN and MMLU benchmarks with up to $70\%$ of the FLOPs, demonstrating notable data efficiency. The study systematically analyzes the effects of labeling prompts, labeling model capacity, distillation model size, and in-context learning, showing that more capable labelers and larger distillation models yield better robustness and results. All experiments are conducted with open-source datasets and evaluation frameworks to enable reproducibility, and the results support expanding data selection to unfiltered scales and domain-specific prompting in future work.

Abstract

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.

Large Language Model-guided Document Selection

TL;DR

This work tackles the compute bottleneck of large language model pre-training by proposing LMDS, a two-stage, scalable document selection pipeline that uses a strong LM to label a sample and a distilled, cheaper LM to apply those labels to a vast corpus. By filtering out roughly 75% of the data, models trained on the remaining subset achieve comparable or superior performance across CoreEN and MMLU benchmarks with up to of the FLOPs, demonstrating notable data efficiency. The study systematically analyzes the effects of labeling prompts, labeling model capacity, distillation model size, and in-context learning, showing that more capable labelers and larger distillation models yield better robustness and results. All experiments are conducted with open-source datasets and evaluation frameworks to enable reproducibility, and the results support expanding data selection to unfiltered scales and domain-specific prompting in future work.

Abstract

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.
Paper Structure (26 sections, 4 figures, 9 tables)

This paper contains 26 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The prompt used to guide the LLM labeler to assess the quality and education value for an input document.
  • Figure 2: The overall pipeline for our proposed LM-guided data selection pipeline. Given a raw corpus, we first sample $n$ documents and guide an LLM labeler to assess them in terms of the textual quality and educational value. The resulting (doc, label) pairs could be distilled into an LM-based quality classifier, which will label all documents in the raw corpus.
  • Figure 3: Learning curves on the downstream task for 7B model pretraining on the raw data versus LMDS-based filtered data.
  • Figure 4: Downstream task accuracy of 1B models trained on LMDS-based filtered datasets with different selection ratios.