Table of Contents
Fetching ...

Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

Lei Zhang, Fangxun Shu, Tianyang Liu, Sucheng Ren, Hao Jiang, Cihang Xie

TL;DR

<3-5 sentence high-level summary> Vision-language pre-training on web-scale data is limited by quality variability in image-text pairs. The authors propose COCO-HF, a human-knowledge–driven filtering pipeline that collects diverse image-caption data, obtains structured human preferences on alignment, trains a reward-model to mimic those preferences, and uses it to compress large corpora while preserving or improving downstream zero-shot performance. Across CC3M, CC12M, LAION-400M, and beyond, the approach yields substantial data reduction (often an order of magnitude) with notable gains in image captioning and retrieval, outperforming CLIP/BLIP-based filtering and classic full-dataset baselines. The work demonstrates that incorporating human judgment into data curation can dramatically improve data efficiency and alignment in vision-language pre-training, offering a practical path to scalable, high-quality multimodal corpora.

Abstract

The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance. This highlights the critical role of data filtering, not only to enhance training efficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is associated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive experiments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.

Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data

TL;DR

<3-5 sentence high-level summary> Vision-language pre-training on web-scale data is limited by quality variability in image-text pairs. The authors propose COCO-HF, a human-knowledge–driven filtering pipeline that collects diverse image-caption data, obtains structured human preferences on alignment, trains a reward-model to mimic those preferences, and uses it to compress large corpora while preserving or improving downstream zero-shot performance. Across CC3M, CC12M, LAION-400M, and beyond, the approach yields substantial data reduction (often an order of magnitude) with notable gains in image captioning and retrieval, outperforming CLIP/BLIP-based filtering and classic full-dataset baselines. The work demonstrates that incorporating human judgment into data curation can dramatically improve data efficiency and alignment in vision-language pre-training, offering a practical path to scalable, high-quality multimodal corpora.

Abstract

The increasing availability of image-text pairs has largely fueled the rapid advancement in vision-language foundation models. However, the vast scale of these datasets inevitably introduces significant variability in data quality, which can adversely affect the model performance. This highlights the critical role of data filtering, not only to enhance training efficiency but also to improve overall data quality. Existing methods typically rely on metrics such as CLIP Score and BLIP Score, which are derived from pre-trained models. However, these models are often trained on uncurated, noisy datasets, which can perpetuate errors and misalignments in the filtered dataset. We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets into a compact and high-quality form. To systemically capture human preferences on image-text alignments, we collect a diverse image-text dataset where each image is associated with multiple captions from various sources, and establish a comprehensive set of both subjective and objective criteria for critically guiding the alignment assessment from labelers. Additionally, we train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment. The resulting reward model thus can act as a human-like referee to filter image-text pairs. Extensive experiments demonstrate that we can maintain, sometimes even improve, model performance while compressing the image-text datasets up to ~90%. An impressive example is that, by aggressively reducing the total training sample from 130M to only 15.5M, our BLIP-B/16 models consistently show an average improvement of 2.9% on retrieval tasks and 11.5% on captioning tasks compared to full-size-dataset counterparts.
Paper Structure (30 sections, 1 equation, 4 figures, 11 tables)

This paper contains 30 sections, 1 equation, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Our method outperforms full-size training dataset on various downstream tasks with BLIP-B/16. This training set consists of CC3M, CC12M, and a subset of LAION-400M. We reduce the training sample size from 130M to 15.5M (i.e.$\sim$9$\times$smaller).
  • Figure 2: A diagram illustrating the three steps of our method. We first curate an image-text dataset to collect human knowledge on alignment in Step 1. Then we train a reward model to predict human preference in Step 2. The reward model functions as a human-like referee to filter misaligned image-text pairs in Step 3.
  • Figure 3: Annotation interface for image caption evaluation. Annotators compare two captions (A and B) for a given image across four criteria (Accuracy, Completeness, Vividness, and Context) respectively.
  • Figure 4: Performance of different compression ratios filtering by CLIP Score, BLIP Score, and our method. Our method achieves the best performance at all scales. Due to the inherent size of the dataset, an excessively low compression rate can harm performance.