Hyperbolic Image-Text Representations
Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Ramakrishna Vedantam
TL;DR
MERU introduces hyperbolic image-text representations via the Lorentz hyperboloid to capture the natural visual-semantic hierarchy. By lifting Euclidean embeddings onto $\mathcal{L}^n$ with the exponential map and coupling a Lorentzian contrastive loss with an entailment loss, MERU induces a structured, interpretable space where text lies closer to the root than images. On scale data (~12M image-text pairs), MERU achieves competitive zero-shot image classification and retrieval compared to CLIP, with added benefits for small embedding dimensions and on-device deployment. Qualitative analysis reveals finer-grained hierarchical organization and explicit root-based entailment, while ablations highlight the necessity of learnable curvature and the entailment objective for robust structure.
Abstract
Visual and linguistic concepts naturally organize themselves in a hierarchy, where a textual concept "dog" entails all images that contain dogs. Despite being intuitive, current large-scale vision and language models such as CLIP do not explicitly capture such hierarchy. We propose MERU, a contrastive model that yields hyperbolic representations of images and text. Hyperbolic spaces have suitable geometric properties to embed tree-like data, so MERU can better capture the underlying hierarchy in image-text datasets. Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP's performance on standard multi-modal tasks like image classification and image-text retrieval. Our code and models are available at https://www.github.com/facebookresearch/meru
