Table of Contents
Fetching ...

Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks

Thaddäus Wiedemer, Yash Sharma, Ameya Prabhu, Matthias Bethge, Wieland Brendel

TL;DR

This work investigates when CLIP-like vision-language models can generalize compositionally on real-world data and whether this ability can be predicted from pretraining frequencies of object concepts. By curating compositional test sets from standard retrieval benchmarks and linking per-sample performance to the independent pretraining frequencies of constituent objects, the authors show a near-linear relationship between the geometric mean of object frequencies and retrieval success, with saturation for strong models. They demonstrate that CLIP can disentangle and recompose objects even for novel combinations, and that data curation strategies that balance object occurrences can improve generalization without increasing data volume. The results provide practical guidance for dataset design and offer a framework for forecasting compositional generalization as a function of pretraining data in real-world VLMs.

Abstract

We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP's efficiency and accuracy without scaling data volume.

Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks

TL;DR

This work investigates when CLIP-like vision-language models can generalize compositionally on real-world data and whether this ability can be predicted from pretraining frequencies of object concepts. By curating compositional test sets from standard retrieval benchmarks and linking per-sample performance to the independent pretraining frequencies of constituent objects, the authors show a near-linear relationship between the geometric mean of object frequencies and retrieval success, with saturation for strong models. They demonstrate that CLIP can disentangle and recompose objects even for novel combinations, and that data curation strategies that balance object occurrences can improve generalization without increasing data volume. The results provide practical guidance for dataset design and offer a framework for forecasting compositional generalization as a function of pretraining data in real-world VLMs.

Abstract

We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly. Additionally, we are the first to show how this ability scales with pretraining data. For data curation in practice, our results suggest that balancing object occurrences improves generalization, which should benefit CLIP's efficiency and accuracy without scaling data volume.

Paper Structure

This paper contains 22 sections, 3 equations, 6 figures.

Figures (6)

  • Figure 1: T2I Recall@10. CLIP's performance on unknown combinations (bottom) matches that on known combinations (top) and can be consistently predicted as a linear function of the average pretraining frequency of the constituent objects. All regression fits are significant at $p<0.01$.
  • Figure 2: I2T Recall@10. CLIP's performance on unknown combinations (bottom) almost matches that on known combinations (top) and can be consistently predicted as a linear function of the average pretraining frequency of the constituent objects. All regression fits are significant at $p<0.01$.
  • Figure 3: T2I Recall@5 We see that on combinations that are both known and unknown to the model, across architectures and pretraining sets, there exists a predictive relationship between the sample frequency, i.e. the aggregated frequencies of objects in the combination, and the performance.
  • Figure 4: I2T Recall@5 We see that on combinations that are both known and unknown to the model, across architectures and pretraining sets, there exists a predictive relationship between the sample frequency, i.e. the aggregated frequencies of objects in the combination, and the performance.
  • Figure 5: T2I Recall@1 We see that on combinations that are both known and unknown to the model, across architectures and pretraining sets, there exists a predictive relationship between the sample frequency, i.e. the aggregated frequencies of objects in the combination, and the performance.
  • ...and 1 more figures