Table of Contents
Fetching ...

Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, Thalaiyasingam Ajanthan

TL;DR

This work tackles the challenge of learning visual hierarchies by embedding multi-level relationships in hyperbolic space $\mathbb{H}^d$ without explicit hierarchical labels. It introduces a hyperbolic angle-based entailment loss $L_{angle}$ that enforces asymmetric parent–child entailment across within- and cross-image pairs, leveraging the Lorentz model and tangent-space maps. The authors additionally propose an optimal transport-based metric (1-D Wasserstein distance) to evaluate hierarchical retrieval and demonstrate strong improvements on a newly constructed HierOpenImages dataset, with notable generalization to out-of-domain LVIS, VOC, and COCO data. Overall, the approach yields semantic-structural representations that surpass standard visual-similarity objectives for hierarchical image retrieval and scene understanding.

Abstract

Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.

Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval

TL;DR

This work tackles the challenge of learning visual hierarchies by embedding multi-level relationships in hyperbolic space without explicit hierarchical labels. It introduces a hyperbolic angle-based entailment loss that enforces asymmetric parent–child entailment across within- and cross-image pairs, leveraging the Lorentz model and tangent-space maps. The authors additionally propose an optimal transport-based metric (1-D Wasserstein distance) to evaluate hierarchical retrieval and demonstrate strong improvements on a newly constructed HierOpenImages dataset, with notable generalization to out-of-domain LVIS, VOC, and COCO data. Overall, the approach yields semantic-structural representations that surpass standard visual-similarity objectives for hierarchical image retrieval and scene understanding.

Abstract

Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.

Paper Structure

This paper contains 36 sections, 12 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: An illustration of part-based image hierarchy organized in hyperbolic space. At the highest level, we see the urban environment, composed of buildings, streets, and sky. Zooming in, we find the building category, which further divides into skyscrapers, mid-rise structures, and more. Each of them has its own visual elements, which in turn can be decomposed into sub-elements.
  • Figure 2: An illustrative example image hierarchies. a) Image $I$ with object-level bounding boxes. Each bounding box is entailed by $I$. b) Hierarchies created via bounding box-to-bounding box entailment within $I$ (larger bounding boxes entail smaller ones). c) Cross-image hierarchy created by sampling $N$ bounding boxes with corresponding object classes from other images, which are then entailed by $I$. Find details of cross-image sampling in Sec. \ref{['sec: Part-Based Image Hierarchy']}.
  • Figure 3: Learning multi-level hierarchies via contrastive entailment angle loss. Our model first encodes parent-to-child pairs into embeddings with exponential mapping, then maximizes $\beta_1$ and $\alpha_2$ using our contrastive entailment angle loss in hyperbolic space.
  • Figure 4: Precision-Recall curves of CLIP ViT models on hierarchical retrieval. Dotted lines show models trained only on hierarchical entailment data within the same images; solid lines represent models trained with additional cross-image scene-to-object samples. Angle or cosine similarity threshold values are marked by text.
  • Figure 5: Example of parent-to-child retrieval using CLIP ViT and our CLIP-hyp$^\dagger$ model. Results are ordered by ascending norms. Our model retrieves images matching the predefined scene-object-part hierarchy, placing high-level objects near the origin (e.g., harbor $\rightarrow$ boart parts), and grouping semantically related but visually distinct objects (e.g., microwave oven & kitchen hood).
  • ...and 6 more figures