Large Language Model-guided Document Selection

Xiang Kong; Tom Gunter; Ruoming Pang

Large Language Model-guided Document Selection

Xiang Kong, Tom Gunter, Ruoming Pang

TL;DR

This work tackles the compute bottleneck of large language model pre-training by proposing LMDS, a two-stage, scalable document selection pipeline that uses a strong LM to label a sample and a distilled, cheaper LM to apply those labels to a vast corpus. By filtering out roughly 75% of the data, models trained on the remaining subset achieve comparable or superior performance across CoreEN and MMLU benchmarks with up to $70\%$ of the FLOPs, demonstrating notable data efficiency. The study systematically analyzes the effects of labeling prompts, labeling model capacity, distillation model size, and in-context learning, showing that more capable labelers and larger distillation models yield better robustness and results. All experiments are conducted with open-source datasets and evaluation frameworks to enable reproducibility, and the results support expanding data selection to unfiltered scales and domain-specific prompting in future work.

Abstract

Large Language Model (LLM) pre-training exhausts an ever growing compute budget, yet recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs. Inspired by efforts suggesting that domain-specific training document selection is in fact an interpretable process [Gunasekar et al., 2023], as well as research showing that instruction-finetuned LLMs are adept zero-shot data labelers [Gilardi et al.,2023], we explore a promising direction for scalable general-domain document selection; employing a prompted LLM as a document grader, we distill quality labels into a classifier model, which is applied at scale to a large, and already heavily-filtered, web-crawl-derived corpus autonomously. Following the guidance of this classifier, we drop 75% of the corpus and train LLMs on the remaining data. Results across multiple benchmarks show that: 1. Filtering allows us to quality-match a model trained on the full corpus across diverse benchmarks with at most 70% of the FLOPs, 2. More capable LLM labelers and classifier models lead to better results that are less sensitive to the labeler's prompt, 3. In-context learning helps to boost the performance of less-capable labeling models. In all cases we use open-source datasets, models, recipes, and evaluation frameworks, so that results can be reproduced by the community.

Large Language Model-guided Document Selection

TL;DR

of the FLOPs, demonstrating notable data efficiency. The study systematically analyzes the effects of labeling prompts, labeling model capacity, distillation model size, and in-context learning, showing that more capable labelers and larger distillation models yield better robustness and results. All experiments are conducted with open-source datasets and evaluation frameworks to enable reproducibility, and the results support expanding data selection to unfiltered scales and domain-specific prompting in future work.

Abstract

Paper Structure (26 sections, 4 figures, 9 tables)

This paper contains 26 sections, 4 figures, 9 tables.

Introduction
Method
Proposed Framework
$\mathbf{LM}_{\text{large}}$-Labelling
$\mathbf{LM}_{\text{small}}$-Refinement
Rationale for Using $\mathbf{LM}_{\text{small}}$ as well as $\mathbf{LM}_{\text{large}}$
Experiments
Experiment Setup
Source dataset:
Language Model-guided Document Selection Pipeline:
Language model training on selected documents:
Language model evaluation:
Experiment Results
Model quality with a Fixed Training Budget:
Data efficiency:
...and 11 more sections

Figures (4)

Figure 1: The prompt used to guide the LLM labeler to assess the quality and education value for an input document.
Figure 2: The overall pipeline for our proposed LM-guided data selection pipeline. Given a raw corpus, we first sample $n$ documents and guide an LLM labeler to assess them in terms of the textual quality and educational value. The resulting (doc, label) pairs could be distilled into an LM-based quality classifier, which will label all documents in the raw corpus.
Figure 3: Learning curves on the downstream task for 7B model pretraining on the raw data versus LMDS-based filtered data.
Figure 4: Downstream task accuracy of 1B models trained on LMDS-based filtered datasets with different selection ratios.

Large Language Model-guided Document Selection

TL;DR

Abstract

Large Language Model-guided Document Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (4)