Hyperbolic Image-Text Representations

Karan Desai; Maximilian Nickel; Tanmay Rajpurohit; Justin Johnson; Ramakrishna Vedantam

Hyperbolic Image-Text Representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam

TL;DR

MERU introduces hyperbolic image-text representations via the Lorentz hyperboloid to capture the natural visual-semantic hierarchy. By lifting Euclidean embeddings onto $\mathcal{L}^n$ with the exponential map and coupling a Lorentzian contrastive loss with an entailment loss, MERU induces a structured, interpretable space where text lies closer to the root than images. On scale data (~12M image-text pairs), MERU achieves competitive zero-shot image classification and retrieval compared to CLIP, with added benefits for small embedding dimensions and on-device deployment. Qualitative analysis reveals finer-grained hierarchical organization and explicit root-based entailment, while ablations highlight the necessity of learnable curvature and the entailment objective for robust structure.

Abstract

Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru

Hyperbolic Image-Text Representations

TL;DR

MERU introduces hyperbolic image-text representations via the Lorentz hyperboloid to capture the natural visual-semantic hierarchy. By lifting Euclidean embeddings onto

with the exponential map and coupling a Lorentzian contrastive loss with an entailment loss, MERU induces a structured, interpretable space where text lies closer to the root than images. On scale data (~12M image-text pairs), MERU achieves competitive zero-shot image classification and retrieval compared to CLIP, with added benefits for small embedding dimensions and on-device deployment. Qualitative analysis reveals finer-grained hierarchical organization and explicit root-based entailment, while ablations highlight the necessity of learnable curvature and the entailment objective for robust structure.

Abstract

Paper Structure (68 sections, 22 equations, 24 figures, 8 tables)

This paper contains 68 sections, 22 equations, 24 figures, 8 tables.

Introduction
Visual-semantic hierarchy.
Vision-language representation learning.
Hyperbolic representations with MERU.
Preliminaries
Riemannian manifolds
Lorentz model of hyperbolic geometry
Definition.
Geodesics.
Tangent space.
Exponential and logarithmic maps.
Approach
Lifting embeddings onto the hyperboloid.
Preventing numerical overflow.
Learning structured embeddings.
...and 53 more sections

Figures (24)

Figure 1: Hyperbolic image-text representations.Left: Images and text depict concepts and can be jointly viewed in a visual-semantic hierarchy, wherein text 'exhausted doggo' is more generic than an image (which might have more details like a cat or snow). Our method MERU embeds images and text in a hyperbolic space that is well-suited to embed tree-like data. Right: Representation manifolds of CLIP (hypersphere) and MERU (hyperboloid) illustrated in 3D. MERU assumes the origin to represent the most generic concept, and embeds text closer to the origin than images.
Figure 2: MERU model design: MERU comprises similar architectural components as standard image-text contrastive models like CLIP. While CLIP projects the embeddings to a unit hypersphere, MERU lifts them onto the Lorentz hyperboloid using the exponential map. The contrastive loss uses the negative of Lorentzian distance as a similarity metric, and a special entailment loss enforces 'text entails image' partial order in the representation space.
Figure 3: Entailment loss (illustrated for $\mathcal{L}^2$): This loss pushes image embedding $\mathbf{y}$ inside an imaginary cone projected by the paired text embedding $\mathbf{x}$, and is implemented as the difference of exterior angle $\angle O \mathbf{x} \mathbf{y}$ and half aperture of the cone. Loss is zero if the image embedding is already inside the cone (left quadrant).
Figure 4: Distribution of embedding distances from [ROOT]: We embed all 12M training images and text using trained MERU and CLIP. Note that precise distance is not necessary for this analysis, so we compute simple monotonic transformations of distances, $d(\mathbf{z})$. MERU embeds text closer to [ROOT] than images.
Figure 5: Image traversals with MERU and CLIP. We perform text retrieval at multiple steps while traversing from an image embedding to [ROOT]. Overall, CLIP retrieves fewer textual concepts (top row), but in some cases it reveals a coarse hierarchy (bottom row). MERU captures hierarchy with significantly greater detail, we observe that: (1) Text becomes more generic we move towards [ROOT], e.g.,white horse $\rightarrow$ equestrian and retro photo camera $\rightarrow$ vintage. (2) MERU has higher recall of concepts than CLIP, like words in bottom row: homemade, city, monument. (3) MERU also shows systematic text$\rightarrow$image entailment, e.g.,day entails many images captured in daylight.
...and 19 more figures

Hyperbolic Image-Text Representations

TL;DR

Abstract

Hyperbolic Image-Text Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (24)