Table of Contents
Fetching ...

Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

TL;DR

Meta CLIP 2 tackles the problem of scaling CLIP to worldwide web-scale multimodal data by introducing a world-wide metadata pipeline, per-language curation, and a training framework that scales seen pairs and explores minimal model capacity. The approach demonstrates that English and non-English data can mutually benefit when scaled together, breaking the curse of multilinguality with ViT-H/14 and achieving state-of-the-art multilingual retrieval and strong English transfer without translation or architecture changes. Key contributions include independent per-language metadata, a language-aware curation algorithm with tail-balanced sampling, and a scalable, open framework that preserves cultural diversity and broad applicability to vision-language tasks. Overall, this work advances foundation models by enabling native-language supervision and worldwide data coverage, with practical implications for multilingual AI systems and downstream MLLMs.

Abstract

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

Meta CLIP 2: A Worldwide Scaling Recipe

TL;DR

Meta CLIP 2 tackles the problem of scaling CLIP to worldwide web-scale multimodal data by introducing a world-wide metadata pipeline, per-language curation, and a training framework that scales seen pairs and explores minimal model capacity. The approach demonstrates that English and non-English data can mutually benefit when scaled together, breaking the curse of multilinguality with ViT-H/14 and achieving state-of-the-art multilingual retrieval and strong English transfer without translation or architecture changes. Key contributions include independent per-language metadata, a language-aware curation algorithm with tail-balanced sampling, and a scalable, open framework that preserves cultural diversity and broad applicability to vision-language tasks. Overall, this work advances foundation models by enabling native-language supervision and worldwide data coverage, with practical implications for multilingual AI systems and downstream MLLMs.

Abstract

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval.

Paper Structure

This paper contains 27 sections, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: (Left) CLIP training suffers from the curse of multilinguality that the English performance of a CLIP model trained on worldwide (i.e., English + non-English), billion-scale data is worse than its English-only counterpart, even when applying our recipe on ViT-L/14; scaling to ViT-H/14 enables non-English data helps English-only CLIP. (Right) English data also helps non-English CLIP.
  • Figure 2: Overview of Meta CLIP 2 recipe: scaling CLIP data and training to worldwide scope.
  • Figure 3: Few-shot geo-localization accuracy on cultural diversity benchmarks.
  • Figure 4: Alignment and uniformity scores wang2020hypersphere calculated on our collected 5k holdout data, WW indicates worldwide data.