The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining
Thiziri Nait Saada, Louis Bethune, Michal Klein, David Grangier, Marco Cuturi, Pierre Ablin
TL;DR
The paper interrogates Classifier-Based Quality Filtering (CQF), a prevalent pretraining data-selection method, by formalizing HQ/LQ distributions and analyzing how CQF scoring relates to true data quality. It shows that CQF’s quality signal largely encodes a likelihood-ratio trade-off rather than closeness to the high-quality distribution, leading to implicit filtering of HQ data and data-conditioning behavior that is not universally beneficial. Through both real-data experiments and semi-synthetic mixtures, it demonstrates that CQF can improve downstream benchmarks without implementing a universal, data-conditioned notion of quality, and contrasts CQF with importance-sampling approaches like CRISP, which better optimize language modeling on HQ data but may not align with downstream goals. The work argues for rethinking CQF as a limited, task-dependent data-conditioning tool and highlights the need for more principled quality notions that accelerate learning across diverse downstream distributions.
Abstract
Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.
