Table of Contents
Fetching ...

MoDE: CLIP Data Experts via Clustering

Jiawei Ma, Po-Yao Huang, Saining Xie, Shang-Wen Li, Luke Zettlemoyer, Shih-Fu Chang, Wen-Tau Yih, Hu Xu

TL;DR

CLIP pretraining on web-crawled image–caption data is hampered by noisy negatives from caption-content misalignment. MoDE tackles this by learning a set of data experts, each trained on semantically coherent caption clusters discovered via a two-step clustering process, and by aggregating expert outputs at inference time based on task metadata. This data-expert ensemble improves zero-shot classification and retrieval across scales while reducing training costs through asynchronous expert training. The approach enables scalable, continual CLIP pretraining and offers a framework for task-conditioned adaptation without parameter updates.

Abstract

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.

MoDE: CLIP Data Experts via Clustering

TL;DR

CLIP pretraining on web-crawled image–caption data is hampered by noisy negatives from caption-content misalignment. MoDE tackles this by learning a set of data experts, each trained on semantically coherent caption clusters discovered via a two-step clustering process, and by aggregating expert outputs at inference time based on task metadata. This data-expert ensemble improves zero-shot classification and retrieval across scales while reducing training costs through asynchronous expert training. The approach enables scalable, continual CLIP pretraining and offers a framework for task-conditioned adaptation without parameter updates.

Abstract

The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.
Paper Structure (30 sections, 7 equations, 7 figures, 17 tables)

This paper contains 30 sections, 7 equations, 7 figures, 17 tables.

Figures (7)

  • Figure 1: For an image-caption pair, the caption may describe limited visual content or even be unrelated, and such noises unavoidably hurt the quality of negative examples to learning a single model. We propose to uncover the clusters from training data, where 1) the pairs with similar images but different captions are assigned to different clusters and 2) the samples in each cluster are of related meanings, and learn a Data Expert for each cluster. These experts are then selectively ensembled for inference.
  • Figure 2: Framework of MoDE via clustering. (Left) We perform a two-step clustering on captions to decide clusters / conditions for data experts. The colored scatter plots are fine-grained clusters and the circles are clusters at coarse-grained level. (Right) Each coarse-grained cluster ($c$) conditions the learning of one data expert $f(\cdot | c)$ and all data experts (colored boxes) are learned asynchronously. For inference, the similarity between task metadata and fine-grained cluster centers ($\{s\}$) is used to decide the routing of data experts. To keep reasonable training cost, all data experts can be initialized with a model partially trained on all data without clustering (omitted for simplicity).
  • Figure 3: Average accuracy CLIP benchmark with increased number of data expert models in MoDE (Pretrain set: 2.5B pairs).
  • Figure 4: Summary of average accuracy on CLIP benchmark and pretraining cost (GPU-Hours). The diameter is proportional to the model size, different approaches are color-coded.
  • Figure 5: Ablation on # of clusters in Step 1.
  • ...and 2 more figures