Table of Contents
Fetching ...

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

TL;DR

This work tackles compositional OoD generalization in CLIP models by introducing ImageNet-AO, a benchmark of unseen attribute–object combinations designed to be novel relative to CLIP training data. It shows that CLIP models pretrained on large, well-curated datasets exhibit stronger C-OoD performance, and that strong text and image representation disentanglement—especially in the text space and its transfer to the image space via contrastive learning—correlates with improved C-OoD generalization. The authors formalize and measure disentanglement using multiple metrics (including Z-diff, Explicitness, and completeness) and demonstrate that intrinsic dimensionality of composition representations declines as C-OoD accuracy rises, indicating more compact, decomposed representations. Through text–image retrieval experiments and analysis on Shapes3D/Sprites, they argue that decomposable, disentangled representations enable more robust compositional reasoning, offering a principled direction for boosting OoD generalization in vision–language models and informing dataset curation strategies.

Abstract

CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our observations reveal varying levels of C-OoD generalization across different CLIP models. We propose that the disentanglement of CLIP representations serves as a critical indicator in this context. By utilizing our synthesized datasets and other existing datasets, we assess various disentanglement metrics of text and image representations. Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models in out-of-distribution settings. This finding suggests promising opportunities for advancing out-of-distribution generalization in CLIPs.

Deciphering the Role of Representation Disentanglement: Investigating Compositional Generalization in CLIP Models

TL;DR

This work tackles compositional OoD generalization in CLIP models by introducing ImageNet-AO, a benchmark of unseen attribute–object combinations designed to be novel relative to CLIP training data. It shows that CLIP models pretrained on large, well-curated datasets exhibit stronger C-OoD performance, and that strong text and image representation disentanglement—especially in the text space and its transfer to the image space via contrastive learning—correlates with improved C-OoD generalization. The authors formalize and measure disentanglement using multiple metrics (including Z-diff, Explicitness, and completeness) and demonstrate that intrinsic dimensionality of composition representations declines as C-OoD accuracy rises, indicating more compact, decomposed representations. Through text–image retrieval experiments and analysis on Shapes3D/Sprites, they argue that decomposable, disentangled representations enable more robust compositional reasoning, offering a principled direction for boosting OoD generalization in vision–language models and informing dataset curation strategies.

Abstract

CLIP models have recently shown to exhibit Out of Distribution (OoD) generalization capabilities. However, Compositional Out of Distribution (C-OoD) generalization, which is a crucial aspect of a model's ability to understand unseen compositions of known concepts, is relatively unexplored for the CLIP models. Our goal is to address this problem and identify the factors that contribute to the C-OoD in CLIPs. We noted that previous studies regarding compositional understanding of CLIPs frequently fail to ensure that test samples are genuinely novel relative to the CLIP training data. To this end, we carefully synthesized a large and diverse dataset in the single object setting, comprising attributes for objects that are highly unlikely to be encountered in the combined training datasets of various CLIP models. This dataset enables an authentic evaluation of C-OoD generalization. Our observations reveal varying levels of C-OoD generalization across different CLIP models. We propose that the disentanglement of CLIP representations serves as a critical indicator in this context. By utilizing our synthesized datasets and other existing datasets, we assess various disentanglement metrics of text and image representations. Our study reveals that the disentanglement of image and text representations, particularly with respect to their compositional elements, plays a crucial role in improving the generalization of CLIP models in out-of-distribution settings. This finding suggests promising opportunities for advancing out-of-distribution generalization in CLIPs.
Paper Structure (46 sections, 4 equations, 13 figures, 9 tables)

This paper contains 46 sections, 4 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Comparing zero-shot compositional out-of-distribution (C-OoD) generalization across diverse CLIP models and training sets. In-distribution (ID) performance is evaluated on the ImageNet validation set with object name labels, while the C-OoD generalization is assessed on our designed compositional dataset using attribute-object pair labels. Noticeably, CLIP models trained on the Common Pool dataset exhibit a steeper accuracy slope when transitioning from the ID to the OoD compositional setting compared to models trained on other datasets like WebLI. CLIPs trained on the LAION and DataComp datasets also show significantly higher C-OoD across ID accuracy. Despite improved in-distribution accuracy, models pretrained on WebLI do not demonstrate substantial gains in generalizing to the novel compositional out-of-distribution test cases.
  • Figure 2: Examples of images from our generated dataset. This dataset is created by combining attributes and objects that do not appear in the CLIP training sets, specifically designed for benchmarking compositional OoD generalization purposes.
  • Figure 3: Dataset Design Stages: The data design process involves a generation phase that makes the initial dataset from the whole set of the object and attribute compositions, and three distinct filtration steps. In the first filtration step, images where the target attribute or object lacks clear visibility are eliminated. In the second filtration step, the process removes images whose captions are already present in public datasets specifically curated for CLIP training. In the third filtration step, the faiss k-nearest neighbors algorithm is employed to identify and filter out images exhibiting similarities.
  • Figure 4: Top: Representation disentanglments are correlated in text and image embeddings of CLIPs. Bottom: Disentanglment metrics vs. C-OoD Accuracy.
  • Figure 5: The decrease in the soft rank of attribute-object representations relative to the embedding size correlates with improved C-OoD accuracy. This indicates that decomposing representations of attributes and objects results in a low dimensional representation of CLIPs that exhibits robust C-OoD performance. This highlights the representation disentanglement in CLIPs with strong C-OoD generalization.
  • ...and 8 more figures