Table of Contents
Fetching ...

CiT: Curation in Training for Effective Vision-Language Data

Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

TL;DR

CiT tackles the prohibitively high cost of large-scale vision-language pretraining by introducing a dynamic data-curation mechanism that operates alongside model training. It uses a text-embedding data proxy, grounded in downstream task metadata, to selectively curate training data in an outer loop while an inner loop optimizes a contrastive image-to-text objective with a frozen vision encoder. Across cleaned and raw web-scale datasets (including LAION400M and a raw crawl), CiT delivers substantial speedups (often orders of magnitude) and competitive or superior zero-shot accuracy compared to LiT and OpenCLIP, even when training on noisy or multilingual data. The approach reduces data-processing pipelines, scales to large pools of raw data, and provides practical gains for researchers and practitioners seeking cost-efficient vision-language pretraining.

Abstract

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.

CiT: Curation in Training for Effective Vision-Language Data

TL;DR

CiT tackles the prohibitively high cost of large-scale vision-language pretraining by introducing a dynamic data-curation mechanism that operates alongside model training. It uses a text-embedding data proxy, grounded in downstream task metadata, to selectively curate training data in an outer loop while an inner loop optimizes a contrastive image-to-text objective with a frozen vision encoder. Across cleaned and raw web-scale datasets (including LAION400M and a raw crawl), CiT delivers substantial speedups (often orders of magnitude) and competitive or superior zero-shot accuracy compared to LiT and OpenCLIP, even when training on noisy or multilingual data. The approach reduces data-processing pipelines, scales to large pools of raw data, and provides practical gains for researchers and practitioners seeking cost-efficient vision-language pretraining.

Abstract

Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
Paper Structure (57 sections, 2 equations, 3 figures, 15 tables, 4 algorithms)

This paper contains 57 sections, 2 equations, 3 figures, 15 tables, 4 algorithms.

Figures (3)

  • Figure 1: A conceptual illustration of CLIP training vs. CiT. Vanilla CLIP training uses static data from offline human filtering (e.g. cleaned YFCC15M or WIT400M radford2021learning) and optimizes the model. Instead, our CiT incorporates dynamic data curation into training in two loops: (i) an outer curation loop improving data (for downstream tasks) given the current model; (ii) an inner loop optimizing the model given the curated data. The trained text model connects the loops by providing embeddings for curation.
  • Figure 2: CiT on provides $>$5$\times$ speedup and +3.4% accuracy gain over LiTzhai2022lit on AugReg ViT-B/32 vision encoders. Training data is YFCC15M. Models are evaluated at 6 evenly sampled iterations.
  • Figure 3: Ratio of curation under different thresholds $t$. CiT broadly uses data first and curates more towards end of training.