Table of Contents
Fetching ...

Compositional Entailment Learning for Hyperbolic Vision-Language Models

Avik Pal, Max van Spengler, Guido Maria D'Amely di Melendugno, Alessandro Flaborea, Fabio Galasso, Pascal Mettes

TL;DR

This work tackles the limitation of Euclidean vision-language representations in capturing hierarchical scene structure. It introduces HyCoCLIP, a hyperbolic vision-language model that jointly reasons over whole images, image boxes, and their textual box descriptions using compositional entailment learning. By combining hierarchical contrastive and entailment losses in the Lorentz model, HyCoCLIP achieves stronger zero-shot classification and hierarchical classification performance, while remaining competitive on retrieval and object-detection tasks. The approach yields a more interpretable, hierarchically organized embedding space, though it relies on generated bounding-box groundings, which increases data processing during training but preserves inference efficiency.

Abstract

Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.

Compositional Entailment Learning for Hyperbolic Vision-Language Models

TL;DR

This work tackles the limitation of Euclidean vision-language representations in capturing hierarchical scene structure. It introduces HyCoCLIP, a hyperbolic vision-language model that jointly reasons over whole images, image boxes, and their textual box descriptions using compositional entailment learning. By combining hierarchical contrastive and entailment losses in the Lorentz model, HyCoCLIP achieves stronger zero-shot classification and hierarchical classification performance, while remaining competitive on retrieval and object-detection tasks. The approach yields a more interpretable, hierarchically organized embedding space, though it relies on generated bounding-box groundings, which increases data processing during training but preserves inference efficiency.

Abstract

Image-text representation learning forms a cornerstone in vision-language models, where pairs of images and textual descriptions are contrastively aligned in a shared embedding space. Since visual and textual concepts are naturally hierarchical, recent work has shown that hyperbolic space can serve as a high-potential manifold to learn vision-language representation with strong downstream performance. In this work, for the first time we show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs. We propose Compositional Entailment Learning for hyperbolic vision-language models. The idea is that an image is not only described by a sentence but is itself a composition of multiple object boxes, each with their own textual description. Such information can be obtained freely by extracting nouns from sentences and using openly available localized grounding models. We show how to hierarchically organize images, image boxes, and their textual descriptions through contrastive and entailment-based objectives. Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning, as well as recent hyperbolic alternatives, with better zero-shot and retrieval generalization and clearly stronger hierarchical performance.

Paper Structure

This paper contains 45 sections, 16 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Compositional Entailment Learning for hyperbolic vision-language models. (a) same object appearing in different vision-language contexts (b) Visual-semantic ordering: $I$ (whole image) and $T$ (full caption) provide context to the more general $I^{box}$ (image local box) and $T^{box}$ (text local box). (c) This specific-general ordering between $(I,T), (I^{box}, T^{box}), (I, I^{box}), (T, T^{box})$ is enforced in hyperbolic space using entailment cones. The external angle $\phi$ of a specific concept ($T$) is pushed to be within the aperture threshold $\eta\omega$ of the general concept ($T^{box}$).
  • Figure 2: An overview of HyCoCLIP. Text and image boxes are extracted offline from image-text datasets (sides). Next, HyCoCLIP's encoder modules embed the images and texts, projecting the representations in the hyperbolic latent space. HyCoCLIP preserves the inter-modal and intra-modal relationships by accommodating broader/finer concepts close to the center/border and by using entailment cones to give an interpretable structure to the learned latent space (cf. Fig. \ref{['fig:entailment_diagram']}).
  • Figure 3: Aperture threshold$\eta$ scaling the aperture $\omega$ to increase or decrease the width of the entailment cone.
  • Figure 4: Histogram of ratios of box area wrt the full image for GRIT and RedCaps. The latter reports generally larger crops, indicating lower precision in grounding concepts.
  • Figure 5: Visualizing the learned hyperbolic space of HyCoCLIP in lower dimensions using samples from GRIT. (a) distribution of embedding distances from the origin, HyCoCLIP embeds text data closer to the origin wrt the images and boxes samples with a smaller radius wrt full images/captions. On the right, (b) HoroPCA and (c) CO-SNE visualizations of the latent space in $\mathbb{L}^2$.
  • ...and 11 more figures