Table of Contents
Fetching ...

Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

Reza Abbasi, Mohammad Samiei, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR

This work found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M.

Abstract

Vision-language models, such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various types of distribution shifts. Recent studies attempted to investigate the leading cause of this capability. In this work, we follow the same path, but focus on a specific type of OoD data - images with novel compositions of attribute-object pairs - and study whether such models can successfully classify those images into composition classes. We carefully designed an authentic image test dataset called ImageNet-AO, consisting of attributes for objects that are unlikely encountered in the CLIP training sets. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.

Language Plays a Pivotal Role in the Object-Attribute Compositional Generalization of CLIP

TL;DR

This work found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M.

Abstract

Vision-language models, such as CLIP, have shown promising Out-of-Distribution (OoD) generalization under various types of distribution shifts. Recent studies attempted to investigate the leading cause of this capability. In this work, we follow the same path, but focus on a specific type of OoD data - images with novel compositions of attribute-object pairs - and study whether such models can successfully classify those images into composition classes. We carefully designed an authentic image test dataset called ImageNet-AO, consisting of attributes for objects that are unlikely encountered in the CLIP training sets. We found that CLIPs trained with large datasets such as OpenAI CLIP, LAION-400M, and LAION-2B show orders-of-magnitude improvement in effective compositional OoD generalization compared to both supervised models and CLIPs trained with smaller datasets, such as CC-12M and YFCC-15M. Our results provide evidence that the scale and diversity of training data and language supervision play a key role in unlocking the compositional generalization abilities of vision-language models.
Paper Structure (17 sections, 1 equation, 7 figures, 2 tables)

This paper contains 17 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: a) Comparing effective OoD generalization of CLIP models with diverse backbones and training sets in a zero shot setting, where no fine-tuning is performed on the target task. The in-distribution (ID) test set is the ImageNet validation split, with the labels being the object names, while the out-of-distribution test set is our designed compositional dataset, with labels being adjective-object pairs. Noticeably, there is a large gap between the performance of CLIPs that are trained on small datasets, e.g. CC15m and YFCC12m, and that of the CLIPs trained on gigantic datasets such as LAION and OpenAI. b) Comparing OoD generalization of the models trained with a supervised loss vs. CLIPs. ID and OoD test sets are the same as before, with the labels being the object names in both ID and OoD test sets, as the adjectives are not among the labels of the pre-trained supervised models. Despite being competitive on ID accuracy, the supervised models fall short of the OoD accuracy of the CLIP models.
  • Figure 2: Examples of images from our generated dataset. This dataset is created by combining adjectives and nouns that do not appear in the CLIP training sets, specifically designed for benchmarking OoD generalization purposes.
  • Figure 3: Normalized Mutual Information between the attributes and objects calculated for various CLIP training sets. The domain of these random variables are defined based on the compositions present in our generated dataset.
  • Figure 4: Evaluation OoD generalization of different CLIP models trained using various datasets. The evaluation involved testing these models on both in-distribution and out-of-distribution test sets.
  • Figure 5: Evaluation of the CLIP models on the subset of our dataset that consists of Imagenet objects
  • ...and 2 more figures