Table of Contents
Fetching ...

Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

Ziliang Chen, Tianang Xiao, Jusheng Zhang, Yongsen Zheng, Xipeng Chen

TL;DR

The paper tackles CLIP's persistent failures in vision-language compositionality by introducing a token-level causal representation learning (CRL) framework built on a language-token SCM. It extends block identifiability to tokenized text, proving that the modal-invariant latent $z_{\rm inv}$ can be recovered under both sentence- and token-level data generation, while also revealing composition nonidentifiability via pseudo-optimal encoders $g^{**}$ that align yet ignore token-level swaps, replacements, and additions. The authors connect language-side nonidentifiability to visual modality gaps and show how iterated composition operators can produce increasingly hard negatives, motivating advanced negative mining strategies. Empirically, they demonstrate that token-level perturbations reproduce many benchmark hard negatives and improve CLIP-based models when used in training, thereby bridging theory to practice and offering practical routes to bolster compositional generalization in multimodal models.

Abstract

Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.

Understanding Hardness of Vision-Language Compositionality from A Token-level Causal Lens

TL;DR

The paper tackles CLIP's persistent failures in vision-language compositionality by introducing a token-level causal representation learning (CRL) framework built on a language-token SCM. It extends block identifiability to tokenized text, proving that the modal-invariant latent can be recovered under both sentence- and token-level data generation, while also revealing composition nonidentifiability via pseudo-optimal encoders that align yet ignore token-level swaps, replacements, and additions. The authors connect language-side nonidentifiability to visual modality gaps and show how iterated composition operators can produce increasingly hard negatives, motivating advanced negative mining strategies. Empirically, they demonstrate that token-level perturbations reproduce many benchmark hard negatives and improve CLIP-based models when used in training, thereby bridging theory to practice and offering practical routes to bolster compositional generalization in multimodal models.

Abstract

Contrastive Language-Image Pre-training (CLIP) delivers strong cross modal generalization by aligning images and texts in a shared embedding space, yet it persistently fails at compositional reasoning over objects, attributes, and relations often behaving like a bag-of-words matcher. Prior causal accounts typically model text as a single vector, obscuring token-level structure and leaving core phenomena-such as prompt sensitivity and failures on hard negatives unexplained. We address this gap with a token-aware causal representation learning (CRL) framework grounded in a sequential, language-token SCM. Our theory extends block identifiability to tokenized text, proving that CLIP's contrastive objective can recover the modal-invariant latent variable under both sentence-level and token-level SCMs. Crucially, token granularity yields the first principled explanation of CLIP's compositional brittleness: composition nonidentifiability. We show the existence of pseudo-optimal text encoders that achieve perfect modal-invariant alignment yet are provably insensitive to SWAP, REPLACE, and ADD operations over atomic concepts, thereby failing to distinguish correct captions from hard negatives despite optimizing the same training objective as true-optimal encoders. The analysis further links language-side nonidentifiability to visual-side failures via the modality gap and shows how iterated composition operators compound hardness, motivating improved negative mining strategies.

Paper Structure

This paper contains 26 sections, 11 theorems, 92 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 2

(Block-Identified Modal-invariant Alignment (Token-agnostic)) Consider the image-text pair generated by Assumption.ass:multimodal data. If their densities and mappings satisfy: 1). $\mathbf{f}$, $\mathbf{g}$Ought to be regarded that we consider the output of $\mathbf{g}$ lies on a continuous space r where $H(\cdot)$ denotes the differential entropy of the random variables $f(\boldsymbol{x}^{\sf (i

Figures (4)

  • Figure 1: Latent-variable SCMs that represents the multimodal image-text data generation processes from the sentence-level aspect (Assumption \ref{['ass:multimodal data']} (a)) and the token-level aspect (Assumption \ref{['ass:multimodal data2']} (b)). The goal of causal representation learning seeks for the unsupervised recovery of the modal-shared latent variable $\boldsymbol{z}_{\sf inv}$ by CLIP, which were rigorously justified in Theorem.\ref{['thm:theorem1']}, \ref{['thm:theorem2']}.
  • Figure 2: The comparison between (a) existing multimodal CRL theory (daunhawer2022identifiability) and (b) our CRL theory (Theorem.\ref{['thm:theorem2']} and Corollary.\ref{['cor:corollary6']}). Our framework allows the analysis to CLIP with the word-and-phrase granularity, leading to our contributions to theoretically explain the CLIP weakness in compositional understanding (Section.4).
  • Figure 3: CLIP's accuracy (ACC) on the negative samples generated by ARO and our Algorithm1. The overlap percentage indicates how many negative samples in ARO belong to the cases in Theorem.\ref{['thm:theorem3']}-\ref{['thm:theorem5']}.
  • Figure 4: CLIP's accuracy (ACC) on the negative samples generated by VALSE and our Algorithm1. The percentage indicates how many negative samples in VALSE belong to the cases in Theorem.\ref{['thm:theorem3']}-\ref{['thm:theorem5']}.

Theorems & Definitions (14)

  • Theorem 2
  • Corollary 3
  • Theorem 5
  • Corollary 6
  • Theorem 7
  • Theorem 8
  • Theorem 9
  • Lemma 10
  • Proposition 11
  • proof
  • ...and 4 more