Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining
Mikey Shechter, Yair Carmon
TL;DR
FLYT reframes vision-language data curation as a learnable, data-weighting problem for CLIP pretraining, using gradient signals from downstream tasks to adapt per-example usefulness. The framework introduces a scoring model (q_\\phi) that ingests multiple data-quality signals (\\Psi) and, via a downstream feedback loop, learns which data points contribute most to downstream performance; Mixing-FLYT (M-FLYT) unifies multiple scoring signals, while Soft Cap Sampling (SCS) samples data from the resulting distribution with a repetition penalty to prevent over-representation. On the DataComp medium-scale benchmark, M-FLYT with SCS achieves $40.1\%$ ImageNet zero-shot accuracy and $37.7\%$ average across 38 tasks, representing a clear improvement over prior public-resource methods, and small-scale results corroborate gains. The approach demonstrates a data-centric path to improving large-scale vision-language pretraining, with potential extensions to larger data pools and language-model filtering.”
Abstract
We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.
