Table of Contents
Fetching ...

Hyperbolic Image-Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam

TL;DR

MERU introduces hyperbolic image-text representations via the Lorentz hyperboloid to capture the natural visual-semantic hierarchy. By lifting Euclidean embeddings onto $\mathcal{L}^n$ with the exponential map and coupling a Lorentzian contrastive loss with an entailment loss, MERU induces a structured, interpretable space where text lies closer to the root than images. On scale data (~12M image-text pairs), MERU achieves competitive zero-shot image classification and retrieval compared to CLIP, with added benefits for small embedding dimensions and on-device deployment. Qualitative analysis reveals finer-grained hierarchical organization and explicit root-based entailment, while ablations highlight the necessity of learnable curvature and the entailment objective for robust structure.

Abstract

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru

Hyperbolic Image-Text Representations

TL;DR

MERU introduces hyperbolic image-text representations via the Lorentz hyperboloid to capture the natural visual-semantic hierarchy. By lifting Euclidean embeddings onto with the exponential map and coupling a Lorentzian contrastive loss with an entailment loss, MERU induces a structured, interpretable space where text lies closer to the root than images. On scale data (~12M image-text pairs), MERU achieves competitive zero-shot image classification and retrieval compared to CLIP, with added benefits for small embedding dimensions and on-device deployment. Qualitative analysis reveals finer-grained hierarchical organization and explicit root-based entailment, while ablations highlight the necessity of learnable curvature and the entailment objective for robust structure.

Abstract

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
Paper Structure (68 sections, 22 equations, 24 figures, 8 tables)

This paper contains 68 sections, 22 equations, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Hyperbolic image-text representations.Left: Images and text depict concepts and can be jointly viewed in a visual-semantic hierarchy, wherein text 'exhausted doggo' is more generic than an image (which might have more details like a cat or snow). Our method MERU embeds images and text in a hyperbolic space that is well-suited to embed tree-like data. Right: Representation manifolds of CLIP (hypersphere) and MERU (hyperboloid) illustrated in 3D. MERU assumes the origin to represent the most generic concept, and embeds text closer to the origin than images.
  • Figure 2: MERU model design: MERU comprises similar architectural components as standard image-text contrastive models like CLIP. While CLIP projects the embeddings to a unit hypersphere, MERU lifts them onto the Lorentz hyperboloid using the exponential map. The contrastive loss uses the negative of Lorentzian distance as a similarity metric, and a special entailment loss enforces 'text entails image' partial order in the representation space.
  • Figure 3: Entailment loss (illustrated for $\mathcal{L}^2$): This loss pushes image embedding $\mathbf{y}$ inside an imaginary cone projected by the paired text embedding $\mathbf{x}$, and is implemented as the difference of exterior angle $\angle O \mathbf{x} \mathbf{y}$ and half aperture of the cone. Loss is zero if the image embedding is already inside the cone (left quadrant).
  • Figure 4: Distribution of embedding distances from [ROOT]: We embed all 12M training images and text using trained MERU and CLIP. Note that precise distance is not necessary for this analysis, so we compute simple monotonic transformations of distances, $d(\mathbf{z})$. MERU embeds text closer to [ROOT] than images.
  • Figure 5: Image traversals with MERU and CLIP. We perform text retrieval at multiple steps while traversing from an image embedding to [ROOT]. Overall, CLIP retrieves fewer textual concepts (top row), but in some cases it reveals a coarse hierarchy (bottom row). MERU captures hierarchy with significantly greater detail, we observe that: (1) Text becomes more generic we move towards [ROOT], e.g.,white horse $\rightarrow$ equestrian and retro photo camera $\rightarrow$ vintage. (2) MERU has higher recall of concepts than CLIP, like words in bottom row: homemade, city, monument. (3) MERU also shows systematic text$\rightarrow$image entailment, e.g.,day entails many images captured in daylight.
  • ...and 19 more figures