CiT: Curation in Training for Effective Vision-Language Data
Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
TL;DR
CiT tackles the prohibitively high cost of large-scale vision-language pretraining by introducing a dynamic data-curation mechanism that operates alongside model training. It uses a text-embedding data proxy, grounded in downstream task metadata, to selectively curate training data in an outer loop while an inner loop optimizes a contrastive image-to-text objective with a frozen vision encoder. Across cleaned and raw web-scale datasets (including LAION400M and a raw crawl), CiT delivers substantial speedups (often orders of magnitude) and competitive or superior zero-shot accuracy compared to LiT and OpenCLIP, even when training on noisy or multilingual data. The approach reduces data-processing pipelines, scales to large pools of raw data, and provides practical gains for researchers and practitioners seeking cost-efficient vision-language pretraining.
Abstract
Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
