Table of Contents
Fetching ...

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

TL;DR

FLYT reframes vision-language data curation as a learnable, data-weighting problem for CLIP pretraining, using gradient signals from downstream tasks to adapt per-example usefulness. The framework introduces a scoring model (q_\\phi) that ingests multiple data-quality signals (\\Psi) and, via a downstream feedback loop, learns which data points contribute most to downstream performance; Mixing-FLYT (M-FLYT) unifies multiple scoring signals, while Soft Cap Sampling (SCS) samples data from the resulting distribution with a repetition penalty to prevent over-representation. On the DataComp medium-scale benchmark, M-FLYT with SCS achieves $40.1\%$ ImageNet zero-shot accuracy and $37.7\%$ average across 38 tasks, representing a clear improvement over prior public-resource methods, and small-scale results corroborate gains. The approach demonstrates a data-centric path to improving large-scale vision-language pretraining, with potential extensions to larger data pools and language-model filtering.”

Abstract

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

TL;DR

FLYT reframes vision-language data curation as a learnable, data-weighting problem for CLIP pretraining, using gradient signals from downstream tasks to adapt per-example usefulness. The framework introduces a scoring model (q_\\phi) that ingests multiple data-quality signals (\\Psi) and, via a downstream feedback loop, learns which data points contribute most to downstream performance; Mixing-FLYT (M-FLYT) unifies multiple scoring signals, while Soft Cap Sampling (SCS) samples data from the resulting distribution with a repetition penalty to prevent over-representation. On the DataComp medium-scale benchmark, M-FLYT with SCS achieves ImageNet zero-shot accuracy and average across 38 tasks, representing a clear improvement over prior public-resource methods, and small-scale results corroborate gains. The approach demonstrates a data-centric path to improving large-scale vision-language pretraining, with potential extensions to larger data pools and language-model filtering.”

Abstract

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4\%.

Paper Structure

This paper contains 39 sections, 26 equations, 8 figures, 15 tables, 2 algorithms.

Figures (8)

  • Figure 1: In each FLYT training loop, the scoring model takes features extracted from a batch of upstream data and generates a score for each example. These scores are converted to weights via softmax. A reference model processes the upstream batch, and the resulting embeddings, together with the weights, are used to compute a weighted CLIP loss. The reference model is then updated using gradients from this loss. Next, this updated reference model processes downstream data to compute a downstream loss, which produces a gradient signal that passes through the updated reference model all the way to the scoring model parameters.
  • Figure 2: Histogram of example repetitions using SCS on the probabilities produced by M-FLYT. See \ref{['sub:sampling_results']} for more details.
  • Figure 3: Comparison of sampling strategies (SCS, HCS, threshold filtering, and No Cap) showing their effect on ImageNet accuracy (left) and average accuracy (right). SCS was tested with ${\alpha}$ values from 0.1 to 0.6, HCS with ${\beta}$ values from 5 to 25 and standard threshold with values from 5% to 25%
  • Figure 4: Comparing SCS to hard threshold filtering on the small scale DataComp benchmark. SCS was tested with ${\alpha}$ values from 0.15 to 1, and standard threshold with values from 10% to 30%
  • Figure 5: M-FLYT top scoring examples.
  • ...and 3 more figures