Table of Contents
Fetching ...

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

Abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

Paper Structure

This paper contains 17 sections, 6 equations, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Method overview. (a) SigLIP uses a learnable query token in combination with an attention layer to pool the visual tokens into a single token. Aligning only global representations hampers the learning of a compositional representation. (b) Similar to SigLIP, our method aligns the global representations $v$ and $t$. To simplify learning of a compositional representation, our method extends SigLIP by first, pooling the text encoder output tokens into concept embeddings $\{c_k\}$, which are used to attention-pool concept-specific information from the visual tokens $\bar{V}$, resulting in $\hat{v}(c)$. $\hat{v}(c_k)$ and the corresponding $c_k$ are aligned using $L_{xac}$. Furthermore, the global visual representation $v$ is aligned with all $c_k$ via ${\mathcal{L}}_{\textbf{npc}}$, a multi-positive variant of SigLIP loss. Similarly, the global image and text representations $v$ and $t$ are aligned via $\mathcal{L}_{contrastive}$.
  • Figure 2: Change in attention when using C2LIP compared to SigLIP. We visualize the difference in attention between C2LIP and SigLIP to the visual tokens given a caption and an image. Higher attention for C2LIP is shown as green. Lower attention for C2LIP indicated with violet. White means no change. (a) Black regions that are not a sweater get reduced attention, the sweater gets more or unchanged attention. (b) The background, the cups and beer cans get attended less after training with our method, while attention to the green tablecloth increases or stays the same.