Table of Contents
Fetching ...

Concept-Aware Batch Sampling Improves Language-Image Pretraining

Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge

TL;DR

This work introduces Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions, and proposes two variants: Diversity Maximization (CABS-DM) and Frequency Maximization (CABS-FM) to curate batches with high object multiplicity.

Abstract

What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

Concept-Aware Batch Sampling Improves Language-Image Pretraining

TL;DR

This work introduces Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions, and proposes two variants: Diversity Maximization (CABS-DM) and Frequency Maximization (CABS-FM) to curate batches with high object multiplicity.

Abstract

What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.

Paper Structure

This paper contains 41 sections, 11 equations, 30 figures, 17 tables, 3 algorithms.

Figures (30)

  • Figure 1: Task-adaptive, steerable, Concept-Aware Batch Sampling (CABS). The per-sample concept multiplicities (left) of MSCOCO retrieval and ImageNet classification train sets depict their divergent distributional properties. By only modifying a simple scoring function, CABS can flexibly adapt to different target tasks (details in \ref{['cabs-formulation-section']}). Both our classification-optimized (CABS-DM, see \ref{['sec:cabs-dm']}) and retrieval-optimized (CABS-FM, see \ref{['sec:cabs-fm']}) variants outperform IID sampling by large margins, across several experimental configurations.
  • Figure 2: DataConcept. We start with images from DataComp gadre2023datacomp and build a concept bank $\mathcal{V}$ by merging, deduplicating, and filtering various concept sources. In ① First-order tagging, we assign a preliminary list of concepts (from $\mathcal{V}$) to each sample. ② We then ground each concept in the image, removing noise in the initial candidates. ③ Lastly, we use a model to transform alt-texts into concept-aware captions.
  • Figure 3: Sub-batch compositions.CABS-DM induces a near-uniform concept frequency distribution, de-biasing the distributional skew induced by IID-sampling. Unique indicates total unique concepts in the sub-batch: CABS-DM incorporates nearly double the concepts in the curated sub-batch, compared to IID.
  • Figure 4: CABS with longer training (1.28B samples seen). Both CABS-DM and CABS-FM show significant boost over IID for ViT-B-32-CLIP in both compute-constrained and data-constrained regimes, the grey dashed line being the point where compute-constraint shift to data-constraint in an IID sampling regime.
  • Figure 5: Qualitative Results with different RAM++ thresholds. While udandarao2024no found 0.7 to be the suitable RAM++ threshold, we show qualitative examples across three different thresholds: 0.7, 0.75, 0.8 on a much larger concept bank. We find the most suitable pool of concepts at the 0.75 confidence threshold.
  • ...and 25 more figures