Table of Contents
Fetching ...

Data curation via joint example selection further accelerates multimodal learning

Talfan Evans, Nikhil Parthasarathy, Hamza Merzic, Olivier J. Henaff

TL;DR

The paper tackles the data-efficiency bottleneck in multimodal pretraining by introducing JEST, a batch-level data curation method that optimizes learnability across whole batches using a pretrained reference model. By decomposing batch loss into per-example terms, JEST uses a sequential Gibbs-like sampling strategy to assemble highly learnable sub-batches, and it employs online model approximation with multi-resolution training (Flexi-JEST) to keep scoring overhead practical. The results show significant improvements in training efficiency, achieving state-of-the-art performance with substantially fewer iterations and FLOPs, and reveal that data-quality bootstrapping—guiding large-scale training with small, well-curated references—can robustly enhance generalization. Collectively, the approach demonstrates that steering the data distribution online, rather than relying solely on static curated datasets, provides a powerful lever for scalable, multimodal foundation-model learning, with potential to simplify data curation pipelines and guide future scaling laws.

Abstract

Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.

Data curation via joint example selection further accelerates multimodal learning

TL;DR

The paper tackles the data-efficiency bottleneck in multimodal pretraining by introducing JEST, a batch-level data curation method that optimizes learnability across whole batches using a pretrained reference model. By decomposing batch loss into per-example terms, JEST uses a sequential Gibbs-like sampling strategy to assemble highly learnable sub-batches, and it employs online model approximation with multi-resolution training (Flexi-JEST) to keep scoring overhead practical. The results show significant improvements in training efficiency, achieving state-of-the-art performance with substantially fewer iterations and FLOPs, and reveal that data-quality bootstrapping—guiding large-scale training with small, well-curated references—can robustly enhance generalization. Collectively, the approach demonstrates that steering the data distribution online, rather than relying solely on static curated datasets, provides a powerful lever for scalable, multimodal foundation-model learning, with potential to simplify data curation pipelines and guide future scaling laws.

Abstract

Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13 fewer iterations and 10 less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.

Paper Structure

This paper contains 22 sections, 4 equations, 10 figures, 5 tables, 3 algorithms.

Figures (10)

  • Figure 1: Joint Example Selection accelerates multimodal pretraining. Our JEST/JEST++ methods bootstrap from small, strongly curated datasets (Webli-curated/Webli-curated++) to actively curate web-scale datasets. Flexi-JEST++ uses variable patch sizing to reduce the cost of curation. Left: Training with JEST matches the performance of the uniform 40B SigLIP baseline with up to 13$\times$ fewer iterations. Middle: Even when accounting for the cost of scoring, our best variant is almost 10$\times$ more FLOP efficient. Right: Comparison of JEST++/FlexiJEST++ (green) to prior methods (grey). Average accuracy is computed across 8 downstream tasks (left, middle; see Table \ref{['tab:appendix_table_2']}), or ImageNet and COCO (right).
  • Figure 2: Joint example selection yields more learnable batches. Left: the learnability of a batch is highly structured and non-diagonal. Middle: Joint example selection quickly discovers sub-batches with high learnability, on-par with brute-force Gibbs sampling. Right: the learnability of sampled batches improves with higher filtering ratios (i.e. selecting from larger super-batches).
  • Figure 3: Joint example selection accelerates multimodal learning. Left: training on the most learnable sub-batch selected from super-batches that are 2$\times$, 5$\times$, or 10$\times$ larger significantly accelerates multimodal learning. Middle: Jointly prioritizing learnable batches yields significantly better results than simply prioritizing individual examples. Right: joint examples selection also improves easy reference prioritization, although learnability scales better with more aggressive filtering.
  • Figure 4: Efficient scoring and multi-resolution training . Left: In scoring large super-batches with the learner and reference models, JEST incurs a large computational cost per iteration. By caching the fixed reference model scores in the dataset, this overhead can be cut in half. Efficient scoring and multi-resolution training further reduce this to be comparable to standard IID training. Middle: Flexi-JEST improves the total FLOP-efficiency of JEST over standard IID training. Right: Multi-resolution training improves FlexiJEST more than standard IID training. Without multi-resolution training (left-most point) Flexi-JEST underperforms the IID baseline (due to an untrained approximate model), but quickly improves with even a small amount of co-training (25%).
  • Figure 5: Scaling strong data curation improves JEST performance.Left: We compare JEST performance vs. reference model performance (relative to the uniform baseline) for 4 curation types: 'weak' curation with image-text alignment (ITA), 'moderate' curation with ITA or text-quality (TQ), and 'strong' curation (using a combination of TQ, ITA, and additional image-quality (IQ). Right: We use our best reference dataset (TQ+ITA+IQ) and evaluate JEST vs. reference performance varying the number of examples seen during reference pretraining. There is a strong correlation between additional reference training and JEST performance that saturates after 1B examples seen. By scaling strong data curation to a 600M dataset, this saturation is broken as both reference model and JEST performance improve for the 1B and 2B reference training.
  • ...and 5 more figures