Table of Contents
Fetching ...

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, Matthias Bethge

TL;DR

This work challenges the notion of zero-shot generalization in multimodal models by systematically analyzing how concept frequency in pretraining data governs downstream performance. It reveals a robust log-linear relationship: performance scales linearly with the log of a concept's pretraining frequency, across classification, retrieval, and generation, and persists under distributional controls and synthetic data conditions. The authors document long-tailed concept distributions, widespread image-text misalignment, and cross-dataset correlations, and they introduce Let It Wag! to probe tail behavior. The findings imply that current zero-shot gains arise largely from data frequency rather than true generalization, highlighting exponential data requirements for meaningful improvements and motivating data-centric approaches. They also release extensive data artifacts and a tail-focused benchmark to accelerate further study of multimodal generalization in the long tail.

Abstract

Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.

No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance

TL;DR

This work challenges the notion of zero-shot generalization in multimodal models by systematically analyzing how concept frequency in pretraining data governs downstream performance. It reveals a robust log-linear relationship: performance scales linearly with the log of a concept's pretraining frequency, across classification, retrieval, and generation, and persists under distributional controls and synthetic data conditions. The authors document long-tailed concept distributions, widespread image-text misalignment, and cross-dataset correlations, and they introduce Let It Wag! to probe tail behavior. The findings imply that current zero-shot gains arise largely from data frequency rather than true generalization, highlighting exponential data requirements for meaningful improvements and motivating data-centric approaches. They also release extensive data artifacts and a tail-focused benchmark to accelerate further study of multimodal generalization in the long tail.

Abstract

Web-crawled pretraining datasets underlie the impressive "zero-shot" evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of "zero-shot" generalization is for such multimodal models, as it is not known to what extent their pretraining datasets encompass the downstream concepts targeted for during "zero-shot" evaluation. In this work, we ask: How is the performance of multimodal models on downstream concepts influenced by the frequency of these concepts in their pretraining datasets? We comprehensively investigate this question across 34 models and five standard pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M, LAION-Aesthetics), generating over 300GB of data artifacts. We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend. This trend persists even when controlling for sample-level similarity between pretraining and downstream datasets, and testing on purely synthetic data distributions. Furthermore, upon benchmarking models on long-tailed data sampled based on our analysis, we demonstrate that multimodal models across the board perform poorly. We contribute this long-tail test set as the "Let it Wag!" benchmark to further research in this direction. Taken together, our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
Paper Structure (42 sections, 31 figures, 13 tables, 1 algorithm)

This paper contains 42 sections, 31 figures, 13 tables, 1 algorithm.

Figures (31)

  • Figure 1: Concept Extraction and Frequency Estimation. (left) We compile $4,029$ concepts from $27$ evaluation datasets. (right) We construct efficient indices for text-search (unigram indexing (1)) and image-search (RAM++ (2)); intersecting hits from both gives (3) image-text matched frequencies.
  • Figure 2: Log-linear relationships between concept frequency and CLIP zero-shot performance. Across all tested architectures (RN50, RN101, ViT-B-32, ViT-B-16, ViT-L-14) and pretraining datasets (CC-3M, CC-12M, YFCC-15M, LAION-400M), we observe a consistent linear relationship between CLIP's zero-shot performance on a concept and the log-scaled pretraining concept frequency. This trend holds for both zero-shot classification (results averaged across 17 datasets) and image-text retrieval (results averaged across 2 datasets). ** indicates that the result is significant ($p<0.05$ with a two-tailed t-test student1908probable), and thus we show Pearson correlation ($\rho$) lee1988thirteen as well.
  • Figure 3: Log-linear relationships between concept frequency and T2I aesthetic scores. Across all tested T2I models pretrained on LAION-Aesthetics, we observe a consistent linear relationship between aesthetic score (averaged across 8 datasets) on a concept and the log-scaled concept frequency.
  • Figure 4: Stress-testing the log-linear scaling trends. We provide further evidence for the log-linear relationship between performance and concept frequency, across different scenarios: (left) we control for "similarity" between downstream test sets and pretraining datasets, and (right) we conduct experiments on an entirely synthetic pretraining distribution with no real-world images or captions.
  • Figure 5: Concept distribution of pre-training datasets is highly long-tailed. We showcase the distribution of pretraining frequencies of all concepts aggregated across all 17 of our downstream classification datasets. Across all the pretraining datasets, we observe very heavy tails. We normalize the concept frequencies and remove concepts with 0 counts for improved readability of the plots.
  • ...and 26 more figures