Table of Contents
Fetching ...

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin

TL;DR

The paper interrogates Classifier-Based Quality Filtering (CQF), a prevalent pretraining data-selection method, by formalizing HQ/LQ distributions and analyzing how CQF scoring relates to true data quality. It shows that CQF’s quality signal largely encodes a likelihood-ratio trade-off rather than closeness to the high-quality distribution, leading to implicit filtering of HQ data and data-conditioning behavior that is not universally beneficial. Through both real-data experiments and semi-synthetic mixtures, it demonstrates that CQF can improve downstream benchmarks without implementing a universal, data-conditioned notion of quality, and contrasts CQF with importance-sampling approaches like CRISP, which better optimize language modeling on HQ data but may not align with downstream goals. The work argues for rethinking CQF as a limited, task-dependent data-conditioning tool and highlights the need for more principled quality notions that accelerate learning across diverse downstream distributions.

Abstract

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

TL;DR

The paper interrogates Classifier-Based Quality Filtering (CQF), a prevalent pretraining data-selection method, by formalizing HQ/LQ distributions and analyzing how CQF scoring relates to true data quality. It shows that CQF’s quality signal largely encodes a likelihood-ratio trade-off rather than closeness to the high-quality distribution, leading to implicit filtering of HQ data and data-conditioning behavior that is not universally beneficial. Through both real-data experiments and semi-synthetic mixtures, it demonstrates that CQF can improve downstream benchmarks without implementing a universal, data-conditioned notion of quality, and contrasts CQF with importance-sampling approaches like CRISP, which better optimize language modeling on HQ data but may not align with downstream goals. The work argues for rethinking CQF as a limited, task-dependent data-conditioning tool and highlights the need for more principled quality notions that accelerate learning across diverse downstream distributions.

Abstract

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

Paper Structure

This paper contains 19 sections, 5 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Classifier-based Quality Filtering (CQF) pipeline. A document embedding model (e.g. sBert, Artic-Embed or FastText) embeds documents from a high-quality dataset and the pretraining set. A binary classifier is trained on those embeddings to distinguish the HQ set from the pretraining set. Scores assigned by the classifier are used to rank documents from the pretraining set. The top $k$ fraction of those documents constitutes the new filtered CQF dataset.
  • Figure 2: Top row: Models trained on increasingly selective data show improved performance on downstream tasks. Bottom row: When evaluated on the HQ dataset itself, these models do not necessarily improve as there is a non-increasing relationship between downstream performance and loss on the HQ set.
  • Figure 3: Two-dimensional PCA projections of sBert embeddings from quality buckets defined by classifiers, each using a different HQ set.. Quality buckets across classifiers (CQF) used in the literature exhibit alignment towards benchmark datasets. When considering the top 100%, we fall back to the original pretraining dataset (RedPajama-V2) regardless of the HQ set used.
  • Figure 4: CQF works by filtering out the low-quality data (red), not because the retained data (green) resemble the HQ set (orange). This is clear both from the raw log-scores of the classifier (left), and in 2D PCA of the sBert latent space (right). TSNE show similar patterns in \ref{['app:biases']}.
  • Figure 5: CQF implicitly filters the HQ set. We split the HQ set (KnowledgePile) into $10$ deciles of CQF scores. Left. For each model trained with CQF at a given fraction $k$, we report the loss of the model on each of these 10 deciles. The reddest curve corresponds to the loss on the HQ elements with the bottom $10\%$ scores, while the greenest curve corresponds to the top $10\%$. Our findings indicate that only the high-quality deciles of the HQ set exhibit a decreasing loss. This suggests that the classifier effectively identifies and learns the features within these deciles, enabling the models to make better predictions. However, on average over all the deciles (dotted line), the loss is a U-curve, recovering the loss in \ref{['fig:accuracy_vs_loss']} (second row and column). Right. In sBert latent space, we compute the distance between the barycenter of ARC-Easy to the barycenter of each HQ decile. This distance correlates well with performance on the ARC-Easy benchmark itself.
  • ...and 15 more figures