Table of Contents
Fetching ...

GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data

Jifan Zhang, Ziyue Luo, Jia Liu, Ness Shroff, Robert Nowak

TL;DR

This work tackles the high-cost challenge of filtering web-scale pretraining data for large language models by introducing SIEVE, a streaming active-distillation system that imitates GPT-4o filtering with a lightweight encoder to achieve GPT-4o-level quality at under $<$1%$>$ of the cost. It couples a novel TRM (True Risk Minimizer) threshold-based streaming active learning algorithm with background distillation to train an efficient binary classifier, dramatically reducing GPT-4o queries while maintaining high filtering fidelity. Theoretical analysis provides balancedness and risk bounds for the TRM-based method, and extensive experiments on OpenWebText show SIEVE matching GPT-4o across multiple domain prompts, plus strong improvements in the Datacomp-LM benchmark. The results demonstrate a scalable, domain-adaptable approach to curate high-quality pretraining data, enabling broader access to high-quality datasets for language model development.

Abstract

Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.

GPT-4o as the Gold Standard: A Scalable and General Purpose Approach to Filter Language Model Pretraining Data

TL;DR

This work tackles the high-cost challenge of filtering web-scale pretraining data for large language models by introducing SIEVE, a streaming active-distillation system that imitates GPT-4o filtering with a lightweight encoder to achieve GPT-4o-level quality at under 1% of the cost. It couples a novel TRM (True Risk Minimizer) threshold-based streaming active learning algorithm with background distillation to train an efficient binary classifier, dramatically reducing GPT-4o queries while maintaining high filtering fidelity. Theoretical analysis provides balancedness and risk bounds for the TRM-based method, and extensive experiments on OpenWebText show SIEVE matching GPT-4o across multiple domain prompts, plus strong improvements in the Datacomp-LM benchmark. The results demonstrate a scalable, domain-adaptable approach to curate high-quality pretraining data, enabling broader access to high-quality datasets for language model development.

Abstract

Large language models require vast amounts of high-quality training data, but effective filtering of web-scale datasets remains a significant challenge. This paper demonstrates that GPT-4o is remarkably effective at identifying high-quality training data, but its prohibitive cost makes it impractical at web-scale. We propose SIEVE, a lightweight alternative that matches GPT-4o accuracy at less than 1\% of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight text classification models, using active learning to fine-tune these models in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. Through different filtering prompts, SIEVE can efficiently curate high quality data for general or specialized domains from web-scale corpora -- a valuable capability given the current scarcity of high-quality domain-specific datasets. Extensive experiments using automatic and human evaluation metrics show that SIEVE and GPT-4o achieve similar performance on five highly specific filtering prompts. In addition, when performing quality filtering on web crawl datasets, we demonstrate SIEVE can further improve over state-of-the-art quality filtering methods in the DataComp-LM challenge for selecting LLM pretraining data.
Paper Structure (26 sections, 2 theorems, 13 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 2 theorems, 13 equations, 3 figures, 5 tables, 1 algorithm.

Key Result

Theorem 4.1

During iteration $r$ of Algorithm alg:active, given the classifier model $f$, with probability at least $1 - \delta$, both $R(\underline{\mu}) - R(s^\star)$ and $R(\bar{\mu}) - R(s^\star)$ are upper bounded by for all the confidences intervals $[\underline{\mu}, \bar{\mu}]$ updated at time $t \in 2^{N^+}$. Here, $c_0, c_1$ and $c_2$ are some universal constants.

Figures (3)

  • Figure 1: System Overview. From user's perspective, SIEVE acts as if applying GPT-4o with the filtering prompt to all text snippets in a web-scale dataset. The output from the SIEVE system is the set of all text snippets that receive a 'pass'. To reduce the prohibitively high cost of applying GPT-4o on every snippet, SIEVE utilizes active learning to distill lightweight filtering models based on pretrained encoders (e.g., T5 or DeBERTa), effectively reducing the overall cost to less than $1\%$.
  • Figure 2: Demonstration of the TRM threshold. Snippets (shown on the bottom) are first ordered based on their predictive sigmoid scores. GPT-4o class labels $0$ and $1$ are represented by the solid or dashed borders. Queried snippets are shaded. Under imbalanced scenarios, sigmoid score of $0.5$ generally will not provide a good indication of where to sample, and will likely result in labeling much more snippets in the majority class. The probability $\mu(s)$ denotes the likelihood of a snippet with sigmoid score $s$ belonging to class 0. The TRM threshold is defined to best separate the two classes of snippets.
  • Figure 3: Active vs Random Distillation: Performance of the distilled lightweight model across different number of queries made to GPT-4o. With active learning algorithm proposed in SIEVE, we can save the number of queries to GPT-4o by more than 5x for the politics filter and more than 3x for the climate filter. We also observe significant query saving against the classic uncertainty sampling algorithm.

Theorems & Definitions (5)

  • Theorem 4.1: jamieson2022interactive
  • Definition 4.2: Score Re-Ordering
  • Definition 4.3: Discrete Smoothness
  • Theorem 4.4: Balancedness of Labeled Snippets
  • proof