Table of Contents
Fetching ...

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

Martha Lewis, Nihal V. Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H. Bach, Ellie Pavlick

TL;DR

This work interrogates whether large vision-language models like CLIP encode grounded compositionality, especially in tasks requiring binding of concepts to syntactic roles. It introduces a controlled concept-binding benchmark with single-object, two-object, and relational datasets and compares CLIP variants to compositional distributional semantics models (CDSMs) and Compositional Soft Prompting. The results show strong performance of CLIP on simple adjective-noun composition but clear limitations in binding-based tasks, with CDSMs offering inconsistent gains and often failing to generalize beyond single-object settings. Overall, the findings highlight substantial gaps in current pretraining for binding and order in grounded visual reasoning and motivate binding-aware pretraining and evaluation in future work.

Abstract

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying "red cube" by reasoning over the constituents "red" and "cube". In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating "cube behind sphere" from "sphere behind cube"). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.

Does CLIP Bind Concepts? Probing Compositionality in Large Image Models

TL;DR

This work interrogates whether large vision-language models like CLIP encode grounded compositionality, especially in tasks requiring binding of concepts to syntactic roles. It introduces a controlled concept-binding benchmark with single-object, two-object, and relational datasets and compares CLIP variants to compositional distributional semantics models (CDSMs) and Compositional Soft Prompting. The results show strong performance of CLIP on simple adjective-noun composition but clear limitations in binding-based tasks, with CDSMs offering inconsistent gains and often failing to generalize beyond single-object settings. Overall, the findings highlight substantial gaps in current pretraining for binding and order in grounded visual reasoning and motivate binding-aware pretraining and evaluation in future work.

Abstract

Large-scale neural network models combining text and images have made incredible progress in recent years. However, it remains an open question to what extent such models encode compositional representations of the concepts over which they operate, such as correctly identifying "red cube" by reasoning over the constituents "red" and "cube". In this work, we focus on the ability of a large pretrained vision and language model (CLIP) to encode compositional concepts and to bind variables in a structure-sensitive way (e.g., differentiating "cube behind sphere" from "sphere behind cube"). To inspect the performance of CLIP, we compare several architectures from research on compositional distributional semantics models (CDSMs), a line of research that attempts to implement traditional compositional linguistic structures within embedding spaces. We benchmark them on three synthetic datasets - single-object, two-object, and relational - designed to test concept binding. We find that CLIP can compose concepts in a single-object setting, but in situations where concept binding is needed, performance drops dramatically. At the same time, CDSMs also perform poorly, with best performance at chance level.
Paper Structure (39 sections, 4 equations, 1 figure, 10 tables, 1 algorithm)

This paper contains 39 sections, 4 equations, 1 figure, 10 tables, 1 algorithm.

Figures (1)

  • Figure 1: Example images and label sets from each dataset. The texts in Green are the true classes and Red are the distractors. Unlike the two-object and relational datasets, the single-object dataset does not require concept binding.